How to Annotate the Right Data?
When starting an annotation project, it is important to select the best data to annotate. Learn how to annotate right data.
How to annotate the right data?
When starting an annotation project, it is important to select the best data to annotate. In other words: the data that would improve the most your model’s accuracy. Most of the time, you will have more data than the amount you can annotate. Therefore what is the best strategy to adopt in order to maximize my model’s performance?
Explore your dataset
The first thing to do is to monitor your dataset. If your project is about classification, you will probably have an idea of the different classes' repartition. However it may happen that you assumed that two classes were equally frequent in the dataset while in fact one of them has far more occurrences. Since you will annotate on a sample of your dataset, you need to be sure of the different classes’ occurrences. Otherwise your sample would not correspond to reality which would probably lead to performance issues in your model.
Quality vs Quantity
Once you know your dataset, you can sample it and start a first prototype of annotation. In most annotation projects, several people are annotating. Therefore a question is raised: should I use my annotators to annotate as much data as possible (ie one annotator per asset) or should I assign several annotators per asset in order to avoid noisy labels? You probably assume that the annotated data I give to my model, the more accurate it would be. That is correct but a model also needs clean labels to have good performances. Unfortunately, human annotators tend to give wrong labels from time to time which is what we call noise. The only way to smooth that noise is to assign multiple annotators on a given asset. Thus if an annotator A gives a label x to a given asset, and annotators B and C give a label y to this same asset, we would be able to detect that annotators do not agree and therefore check which label is the correct one. This way, we are sure to give clean labels. Moreover it has been proved that giving a cleaned training dataset leads to better performance (cf. https://jair.org/index.php/jair/article/view/12125 and https://l7.curtisnorthcutt.com/confident-learning)
First prototype and performance monitoring
The next step is to build a prototype (reformulate: it is better to have a pipeline quickly). After annotating a few assets, you would be able to train your model on these first assets. By monitoring your model’s performance, you would
probably detect that some categories have less accuracy than others. In order to counter this bad performance different actions are possible. First you can provide more data from this particular category to your model, with the assumption that the more data your model will have, the more accurate it will be. Secondly, you can monitor the agreement between annotators. Some annotation jobs may lead to a disagreement between annotators. It can be caused by a misformulated task or simply a task that is complex. Therefore you can choose to assign more annotators for these specific jobs in order to decrease the noise in labels. Thirdly, in the case of object detection you can choose a more accurate tool. For example you can switch from bounding boxes to polygons (cf our article on bounding box vs polygon) while keeping in mind that a more accurate tool requires more time for an annotator which at the end leads to less data annotated.
Fine tuning and continuous improvement:
At this stage, you have a first proof of concept and your model is ready to be deployed on production. Your model already has decent performance but you want to keep improving it with new annotations. As said above, the biggest struggle for a machine learning model is noisy labels. By monitoring the agreement metrics between annotators on a given label, you can detect where the noisy labels are and concentrate your workforce on these assets to smooth the noise and therefore improve your model’s performance.
Another possibility to improve your model’s accuracy is the use of active learning. The idea of active learning is to detect the predictions that your model is the most unsure of and then annotate these assets. This way, your model will be trained on these assets which will lead to an improvement of accuracy.