How to make annotation painless ?
The need in data annotation increases as the use of supervised machine learning increases. Only humans can give the groudtruth label in order to be able to train model. As an annotator, the job of annotating is very ungrateful, repetitive and sometimes complex. As an annotation project manager, managing a project with thousands, sometimes thousands of thousands, sometimes millions of assets can also be difficult. Furthermore, the problem is only getting more and more complex a the amount of data is getting bigger and bigger.
Annotation can very easily become a pain. In this article, we will have a look at some tools and good practices that will avoid you to go mad in an annotation project.
Do not underestimate the time needed for the definition of the project
When working on an annotation project, having to change the annotation strategy in the middle of the project can be catastrophic. Labelling a whole dataset is very time consuming. A least little mistake or omission in the definition of the project can cost you a lot of time as you would need to take the project back from scratch.
While defining the project, a long reflexion needs to be carried out at the beginning before the labelling phase. What type of annotation do I need ? How will I use them ? Does my workforce needs an expertise ? All of these questions are examples of question that you need to ask yourself while defining the annotation project. Spending time to answer these questions at the beginning will save you time avoiding to label again assets.
Spend time defining good guidelines
Guidelines are very important. As an annotation project manager, you might have a clear idea of what you want but if you have 100 annotators, all of them won’t necessarily produce the same labels. You need, as much as possible, to spend time defining good labelling instructions to unify the vision that you have among all annotators.
In a project, imagine that 3000 assets of your dataset have already been been annotated. During you supervision of the project, while reviewing annotators, you notice that all assets have a very poor consensus between annotators. By exploring the labels, you notice that the annotators don’t label objects the same way. On the image below, you can for example see the different way to annotate one same image. You could give on label that embraces all objects or several labels for each kite with or without kitsurfers. All annotators could annotate differently this image and this can be the case for all assets of your dataset.
This happened because you didn’t give guidelines to you annotators or they were not clear enough. If this happen, you would need to clarify the labeling instructions and to label again all the faulty assets. This would be a huge loss of time ! Giving good instructions to annotators is a key step in an annotation project.
In Kili plateforme, you can upload guidelines for every project and these guidelines are accessible by annotators in the labelling interface.
Spend time exploring the dataset
While labeling the dataset, you notice that you need a new category because a lot of asset share a common feature and you think that it could be a good discriminative feature for your classification task. You didn’t spent enough time exploring the dataset and thus you weren’t able to define correctly your different categories. You find yourself in the same case as before : you need to add this new category and to ask to your annotators to annotate again the first assets. Once again you didn’t spend enough time at the beginning of your project to think and design the project.
Exploring your dataset is very important step when beginning an annotation project to spot the common features, limitations and corner cases of some assets. It is only when having a global vision of your dataset that you can define the annotation for the desired functionality that you want.
Spend time scoping your ressources
Remember to spend time thinking to your timeline constraint and your budget constraint. Spending time to estimate the cost in time of the project will let you know if you can annotate the totality of the dataset or not. If not, it is a good idea to do a selection of what asset might be the most important. Don’t forget to count the workforce formation, the review phase, the time to label new assets that might come etc. The number of asset to annotate can quickly become a pain to deliver annotation on time.
Your budget can also lead you to select the good workforce. Selecting a workforce with an expertise can save you time to train it and produce labels with a better quality. Considerations about budget and the choice of the workforce is a key step at the beginning of the project to avoid any surprise during the project.
Use smart tools if possible
Smart tools are a very good way to fasten your annotation project. They are machine learning powered tools that help you during your annotation by inferring what you might want to do.
In image segmentation
Annotation on image segmentation project can be one of the longest annotation task. It is relevant if you want to determine a very precise position of an object in an image, pixel by pixel. Drawing with a mouse the border of the object is both very long and very imprecise. Two possible smart tools to address this issue are Superpixels and interactive segmentation.
Tools based on Superpixel display cluster of pre-computed pixels according to the pixels colors. Pixels are the most elementary object of an image and annotating at the pixel size is often an overkilling task. Having Superpixels allow to easily merge cluster of pixels into the shape that you want to label.
Tools based on interactive segmentation allow you to draw a very simple shape on the object that you want to label and the model will infer automatically what shape you want to select with a pixel precision. There are different UX of interactive segmentation, like drawing a bounding box that contains the object that you want to label, or putting a point on the shape.
In Kili, we have a very fast interactive segmentation tool. It allows you to put positive points on the shape that you want to label and to put negative points outside the shape to refine the selection. This is a very powerful tool that will save you a lot of time in your annotation project and will make the annotation task less ungrateful for annotators.
In image object detection
Image object detection tools will automatically find the presence and the position of an object in an image. These annotations can be given as predictions and are validated by an annotators. While annotating, the human eye can spot every occurence of a class within a second. It is the task of drawing the position of every object that is time consuming.
Object detection tools achieve two task :
- a classification job to say if an object is present or not in the image
- an image detection task to predict the most accurate position of the object if the classification job said that the object is present.
In video annotation
Video can be extremely long to annotate. Object detection in a video needs to be done on every frame. Hopefully, tracking tools exist, performing different tasks :
- The task of Multi Object Tracking (MOT) extends the image object detection of the last paragraph to videos. It is able to locate the position of an object and adds a causality dimension to track it through frames.
- the task of Single Object Tracking (SOT) let the user locate the object on the first frame and infer the position of the object in the following frames.
Tracking an object in every frame is one of the most painful task in annotation and such smart tool will save your life as an annotator as well as a project manager.
These are only a few number of examples of tools that can make you life easier in your annotation project. It will save you a lot of time, can make more precise annotation and make the annotation task less painful. At Kili, we have implemented a lot of smart tools for several data types. We have state of the art models of Superpixels and interactive segmentation. We just release a new version of Single Object Tracking that is faster and more accurate that the previous one. All these smart tools will save you a colossal amount of time in your annotation task
These tools are based on AI models but keep in mind that the goal of the project is to let humans produce ground truth annotations. Letting models annotate data to train other models is nonsense. These tools are here to accelerate the human annotation, not to replace it. If they fail to predict the annotation you want to produce, the human will always have to correct it if possible or to annotate without the smart tool.
Make sure than you can easily supervise the project
Another weakness of your data annotation solution that can make annotation painful is the lack of tools allowing you to easily supervise the project and the annotators.
Audit the Quality
Humans define groundtruth annotation but this groundtruth is subjective and all human won’t give the same labels to a same asset. You might have given instructions but how to be sure that they are followed ? Annotation is a very repetitive task and the quality of annotations could reduce with time but how to spot it ? Not being able to audit the quality of annotation is a point of painfulness.
Review asset and give feedback
Being able to review annotation is a very helpful feature. It will help you, as an annotation project manager, to keep an eye on the quality of the annotation. If some labels do not follow the guidelines that you defined, it will give you the possibility to give feedback to annotators. It will also give you a space to answer to the annotator questions or to look at issues that they might have raised during the annotation.
In a project were labeling is complex and different persons might give a different label, it is a good idea to let assets be labeled by several annotators. Having a metric to measure consensus among annotators will let you spot the assets that are ambiguous. It will help you to delete assets where no one agree and where it is difficult to give a label. It will let you be able to review de ambiguous asset to remove the uncertainty. Moreover, it will give you an idea of the complexity of your dataset in terms of annotation.
Honeypot and Golden Standard
Honeypot or golden standard are different denominations for the same idea. It let you set a reference label that you define as the groudtruth. Then, it let you measure the disagreement, this time not between annotators, but between this groundtruth label and the annotator labels. Thus, honeypot is a measure of the quality of the annotation and is a good tool to audit the quality of annotations.
Supervising a project without such features will let you have the impression that you advance in the fog. Not being able to audit the quality of annotation is one major cause of difficulty in an annotation project.
At Kili, all features defined above are integrated in the application with even more complex functionnalities. For more information on what we offer, you can have a look at the documentation of our plateforme.
Control queue and order
Priority exists in every project. In an annotation project, there are many criteria making you feel that an asset have a bigger priority than others.
At the beginning of your annotation project, you could want to prioritize asset to label according to the impact that they could have for your model training. When training a model, a new training asset won’t necessarily increase what the model learns, to do so, this asset will need to carry a new information.
When reviewing labels, it is a good idea to focus on assets with a low consensus. If you have plenty of assets to review, you could want to put the asset with the lowest consensus at the top of the review queue.
Having the possibility to order assets or labels in queue and to give priority will make your life easier.
Track the projet progress
Project management is very important to fit in your constraint of time, budget and available ressources. Being able to keep track of the evolution of the project is an absolute need and not being able to quickly have an overview on it is very painful.
KPIs and metrics are very important in an annotation tool. You need to know how many assets have been labeled yet ? How many are still left to do ? How many labels are there in each category ? How many assets do we label by day ? What is the mean time spent by annotator by asset ?
You cannot spend time everyday to do statistics to have these information. You cannot loose your time on calcul sheets to report the evolution of the project to update KPIs. All these information need to be easily accessible and updated continuously.
At Kili, you can find a very exhaustive overview of the advancement of the project with metrics and KPIs.
By working in the annotation domain, I know how annotation can be painful. That is why Kili exists. We are working hard to provide the best annotation plateforme making your annotation day-to-day life easier. For further reading, I would recommend to have a look at Best Practices for Managing Data Annotation Projects by Bloomberg that synthesis good practices to handle an annotation project in an article.
Best Practices for Managing Data Annotation Projects : Tina Tseng, Amanda Stent, Domenic Maida, https://arxiv.org/abs/2009.11654