While many methods have been proposed to ensure data quality for objective tasks (in which a single correct response is presumed to exist for each item), estimating data quality with subjective tasks remains largely unexplored. Consider the popular task of collecting instance ratings from human judges: while agreement tends be high for instances having extremely good or bad properties, instances with more middling properties naturally elicit a wider variance in opinion. In addition, because such subjectivity permits a valid diversity of responses, it can be difficult to detect if a judge does not undertake the task in good faith. To address this, we propose a probabilistic, heteroskedastic model in which the means and variances of worker responses are modeled as functions of instance attributes. We derive efficient Expectation Maximization (EM) learning and variational inference algorithms for parameter estimation. We apply our model to a large dataset of 24,132 Mechanical Turk ratings of user experience in viewing videos on smartphones with varying hardware capabilities. Results show that our method is effective at both predicting user ratings and in detecting unreliable respondents.
Published Date: 2016-11-03
Registration: ISBN 978-1-57735-774-2