Getting a clean dataset from multiple labelers
A brief overview of the current scientific literature on the problem of unifying noisy labels in the case of classification.
Recent advances in supervised learning owe, partly, to the availability of large annotated datasets. For instance, the performance of modern image classifiers saturates only with millions of labeled examples. Assembling such datasets typically requires the extensive labor of human annotators. Platforms like the one offered by Kili Technology allow efficient annotation of data points (or assets) by multiple people at the same time. We wish to output a dataset with a single label for each asset. However, for efficiency and robustness, the annotation process is often done simultaneously by several labelers. Then, several labelers might label the same asset differently, introducing a noisy dataset. This noise might come for various reasons, such as unclear instructions, naturally ambiguous labels, skill differences among labelers, or just human mistakes.
Commonly, some super-annotator (or reviewer) will be tasked to check if the data has been labeled correctly and to select the best label to use in the final dataset. However, this method has its limitations, as it adds another layer of costly manual labor when creating a labeled dataset. This super-annotator also commonly only sends feedback to the annotators, creating a correction loop that takes time and slows down the annotation process.
This problem has drawn some attention in the scientific literature, as it relates to the problem of crowdsourcing knowledge: how to retrieve truth from many noisy and unreliable estimators. For the problem of unifying annotations from concurrent labelers, the scientific literature extensively studied the case of binary and multiclass classification. The problem is tackled through several algorithms, having their own advantages and drawbacks. They can be simple (computationally cheap) –like the majority vote [Whitehill et al., 2009], or more complex but yielding better results –like the Dawid & Skene estimator [Dawid and Skene, 1979].
Unifying labels for classification task
Classification is a supervised learning problem, where the goal is to predict the label of a data point among a pool of predefined labels.
Let us remind ourselves that when building a training dataset, we do not have access to the ground truth of labels. We only have access to the labels given by labelers, which might contradict each other. Not all labelers annotate all data: the amount of annotations given to each label is heterogeneous. This corresponds to the reality of crowdsourced annotators, who will not necessarily annotate the whole dataset, or do not have the knowledge to annotate some elements of the dataset. In other words, different labelers will label every data point (or asset) at least once.
In this review, we will study methods that output a unified label per data point from several labels for the case of a classification task. First, we will describe more precisely the unification problem. Then, we will examine three different algorithms that solve the task: the majority vote, Dawid & Skene estimator, and finally, the GLAD algorithm.
In the case of a classification task, the datasets are often aggregated by applying a simple majority vote. Academics have proposed many efficient algorithms for estimating the ground truth from noisy annotations. Research addressing the crowd-sourcing problem goes back to the early 1970s’ to compile patient records. Dawid & Skene [Dawid and Skene, 1979] proposed a probabilistic model to jointly estimate worker skills and ground truth labels and used expectation maximization (EM) to estimate the parameters. Whitehill et al. [Whitehill et al., 2009]; Welinder et al. [welinder et al., 2010], proposed a generalization of the Dawid-Skene model, e.g., by estimating the difficulty of each example.
We can try to better understand the problem with an example. Let’s suppose we attempt to classify images in a binary classification problem with categories 0 and 1. Let’s further suppose that we have 5 labelers working on the project simultaneously. The output of their work might look like the following table:
|-||Labeler 1||Labeler 2||Labeler 3||Labeler 4||Labeler 5|
For this to be useful to train a model, we want to output a single answer (0 or 1) for each image. In this setup, we could output a dataset that looks like this:
|-||Estimated ground truth|
This output dataset was not done randomly, but by following an intuition: for each image, we would like to choose the label that makes the biggest consensus. Since all labelers seem to agree to classify image 1 as a ‘1’, it would make sense to output a ‘1’ in the final dataset. In the case of binary classification, we can sum up that intuition by this formula:
with l(Xi) the label associated to the asset Xi, m the total number of assets and lj the function that associates to an asset the label decided by labelerj
That is the majority vote.
The majority vote
Majority vote is the easiest way to aggregate labels in the case of a classification task. The idea behind the majority vote is that for a given asset, the final label chosen will be the label chosen by the largest number of annotators.
fig 1. Dependence diagram of the majority vote.
This is depicted in figure 1. In this figure, we want to annotate images 1,..,n, without having access to the ground-truth labels Z1,...,Zn. Each labeler gives a label for an image, Lij means labeler i annotated image j. Then, we combine the given labels to output the predicted labels Z1,...,Zn.
This very simple method has several drawbacks limiting its real-world application. Indeed, we expect such techniques to estimate the reliability of ad-hoc anonymous labelers, the difficulty of the different items in the dataset, and the probability of the true labels given the currently available data. Indeed, some labelers might misunderstand the labeling guidelines and consistently mislabel the data. For this reason, these labelers should not weigh as much as others. More complex heuristics have been proposed to solve these issues.
Discover how training data can make or break your AI projects, and how to implement the Data Centric AI philosophy in your ML projects.
Dawid and Skene estimator
Another classical method for estimating the ground-truth labels from the noisy labels is described by Dawid and Skene [Dawid and Skene, 1979]. In its simplified form, each worker is assumed to have a number ɑi to characterize the intrinsic labeling ability of labeler i. For any given probability ɑi, the label from the i-th worker is correct, and with a probability of 1-ɑi , the label from the i-th worker is wrong. Workers’ abilities ɑi can then be estimated by maximizing the marginal likelihood, and an estimate of the true labels follows by plugging the estimate of workers’ abilities into Bayes’s rule.
fig 2. Dependence diagram of Dawid & Skene estimator.
Moreover, by regarding the true labels as latent variables, this two-stage estimation can be iteratively solved through the Expectation-Maximization (EM) algorithm [dempster et al., 1977]. Various variants emerged from this algorithm, by incorporating Bayesian inference into the Dawid-Skene estimator by assuming a prior over confusion matrices [Raykar et al. 2010]. This enables, for instance, to incorporate the sensitivity and specificity of the different labelers in the model.
fig 3. Dependence diagram of the GLAD estimator.
Some more advanced estimators, like the GLAD (Generative model of Labels, Abilities, and Difficulties) estimator proposed by Whitehill et al. [Whitehill et al., 2009], are created to take into account the image difficulty 𝛽i, in addition to labelers accuracies ɑi of the Dawid & Skene estimator. This setup fixes the problem where all labelers do not have the same experience level, or not the same quality of output, in case one of the labelers is an automatic classifier (or any other source of pre-annotation). The utility of modeling the image difficulty is easy to understand: if there is an ambiguous data point in the dataset (like a blurry image), then giving a ‘good’ label will be harder, or labelers might agree on a wrong label in that case, not because of lack of knowledge, but purely because of the difficulty of that particular piece of data.
Like before, the maximum likelihood estimate of the parameters of interest is obtained by using Expectation-Maximization approach (EM).
GLAD keeps the advantages of the majority vote (small computational cost) but can recover the true data labels more accurately than the Majority Vote heuristic. It is highly robust to both noisy and adversarial labelers. Experiments show it yields more accuracy than the Dawid & Skene estimator.
To explore the importance of estimating image difficulty, a simple simulation is done in [Whitehill et al., 2009]: Image labels (0 or 1) are assigned randomly (with equal probability) to 1000 images. Half of the images were “hard”, and half were “easy.” Fifty simulated labelers labeled all 1000 images. The proportion of “good” to “bad” labelers is 25:1. The probability of correctness for each image difficulty and labeler quality combination is given in the table below.
table 1. Error rates of different algorithms for the unification task, extracted from [Whitehill et al., 2009].
In Table 1, we can notice that the GLAD algorithm achieves a lower error rate than the Dawid & Skene and the Majority vote on the experiment described above, showing the importance of taking image difficulty into account.
There are various ways to solve conflicts when labeling data using multiple annotators. We reviewed 3 different ways of choosing labels in the case of a classification task, of increasing complexity. These methods all achieve the same purpose: output a single annotation for each data point when given multiple (possibly concurrent) annotations from several labelers. These methods have varying levels of complexity and accuracy but yield useful results in the case of unifying annotations for a classification task. For more complex ML tasks (Named Entity Recognition, Object detection), no unification technique has been proposed in the academic literature yet.
Alexander Philip Dawid and Allan M Skene. “Maximum likelihood estimation of observer error rates using the EM algorithm”. In: Journal of the Royal Statistical Society: Series C (Applied Statistics) 28.1 (1979), pp. 20–28.URL: http://crowdsourcing-class.org/readings/downloads/ml/EM.pdf
Jacob Whitehill et al. “Whose vote should count more: Optimal integration of labels from labelers of unknown expertise”. In: Advances in neural information processing systems 22 (2009). URL:https://proceedings.neurips.cc/paper/2009/file/f899139df5e1059396431415e770c6dd-Paper.pdf
Peter Welinder et al. “The multidimensional wisdom of crowds”. In: Advances in neural information processing systems 23 (2010).URL: https://proceedings.neurips.cc/paper/2010/file/0f9cafd014db7a619ddb4276af0d692c-Paper.pdf
Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. "Maximum likelihood from incomplete data via the EM algorithm." Journal of the Royal Statistical Society: Series B (Methodological) 39.1 (1977): 1-22. URL: https://www.eng.auburn.edu/~roppeth/courses/00sum13/7970%202013A%20ADvMobRob%20sp13/literature/paper%20W%20refs/dempster%20EM%201977.pdf
Raykar, Vikas C., et al. "Learning from crowds." Journal of machine learning research 11.4 (2010). URL: https://www.jmlr.org/papers/volume11/raykar10a/raykar10a.pdf