How to build an efficient data labeling plan?
Sometimes, you need to label the data you'll use for your AI model yourself. Find out how to build an efficient labeling plan here.
You need a lot of labeled data in order to train your ML models. Although it's not always necessary, you will sometimes need to label the data yourself that you will use to train your model. The process of annotating data in an end-to-end ML project is important and also time-consuming. In order to do it correctly, you will have to make decisions on how you will label your data. In which classes do you want to separate your data, for example? It is the set of all those decisions, in order to properly annotate your data, that we will define as the “data labeling plan”.
Unfortunately, although it plays a major role in the successful training of an AI model, creating an efficient data labeling plan is neither a well-defined nor adequately documented process. The goal of this article is to give you an insight into the process of building a labeling plan. For the sake of simplicity, we will only focus on the case of a multi-class classification problem. Each machine learning problem has its own particularities regarding labeling plans. Therefore, what we will do here is to present the general approach. It will aim to answer the main questions you may want to ask yourself when building your own labeling plan.
We will divide this article into two parts. First, we will look at the decision process relative to your labeling plan, based on machine learning principles. Then, we will focus on the decision process based on expertise in annotation.
Labeling plan: Decision process based on machine learning
As we consider only multi-class classification problems, you must identify the classes into which you want to separate your data. The question may appear trivial, but in fact it is not. For example, let's say you have initially decided to divide your data in (10) different classes. The 10th one is not enough represented to train your model. Should you still train your model with so few examples? Do you really need to correctly classify the data from this class? And if you don’t, could you just ignore this class? These are the kinds of questions we want to ask ourselves when building our labeling plan.
1. Choosing the number of classes: things to consider
What does your problem tell you?
The information given by your problem is the first thing to consider. To be more explicit: does the definition of the problem you want to solve directly give you the number of classes to consider? For example, in the case of the MNIST dataset, we have a clearly defined problem: to classify correctly each handwritten digit. We have ten digits, so we divided the data into ten classes. Make sure to have a good understanding of your problem in choosing the number of classes.
This is your first insight on how many classes you will need. Although it’s always a good indicator, it may happen that you cannot separate your data according to the definition of the problem. For example, when you don't have enough data on a specific class for your model to train properly. Which leads us to our second point.
How much data do you have?
If you want your model to separate each class correctly , it needs to have a sufficient quantity of data. If not, even if you choose the right number of classes, you will not be able to classify some of your data. It’s known as the problem of imbalanced data. And, instead of trying some data augmentation techniques (we will not go over this practice in this article), you may consider if they have some similarities in order to regroup the under-represented classes. We will then separate the grouped classes using nested representation. You may also consider retrieving and labeling more data if you’re able to. This option is often disregarded in many AI projects.
The constraint of transfer learning
One particular case that needs to be addressed is when you already have a trained model provided and you must use it to solve your problem. It’s known as transfer learning and, even if it is widely used, it has some limitations. Transfer learning only works if the initial and target problems of both models are similar enough. If not, the transfer learning ends up with a decrease in the performance or accuracy of the new model. You will then have to adapt your classes according to the pre-trained model, in order to not decrease its performance. This constraint can give you an insight into how many classes you want to separate your data.
What are your business needs?
The last and in many cases the most important thing, is the question of business needs. Let’s take another example. You work on a problem for your company, which is to classify different species of fish. You have in your data many species of fish. But what your company really wants is to recognize tuna perfectly from all the rest. Therefore, you only need 2 classes: One for identifying tuna and the second, which will regroup every other species. Thus, be careful to identify what the business really needs. Do not try to solve a problem which has not been defined by your constraints.
2. Choosing the number of classes: Heuristic approaches
When you have to decide on the number of classes, one interesting thing is to visualize your data on a plane using a projection. Then you will be able to see how statistical methods directly separate your data. Dimension reduction techniques can be pretty useful in this case.
The first one we recommend is the t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm. It helps in visualizing high dimensional data (such as images) by giving each data point a location in a two- or three-dimensional map.
Although it is a powerful algorithm, t-SNE is in fact outdated. The state-of-the-art now is Uniform Manifold Approximation and Projection (UMAP), which computes much faster than t-SNE. The goal here is to use dimension reduction to visualize your data and see how they are separated by optimized algorithms. It is important to note that these statistical methods only serve as decision aids and not as solutions.
After all these steps, you should have correctly chosen the right number of classes you will need. You can decide right after what the classes are for your multi-class classification problem.
Those are the decision methods made before you start labeling your data. But you can – if you feel the need to adjust your decision when you are in the midst of the labeling process. This depends on the problems you will encounter when labeling your data. Our next section focuses on this aspect of the labeling plan.
Discover how training data can make or break your AI projects, and how to implement the Data Centric AI philosophy in your ML projects.
II. Labeling plan: Practice based on expertise
Let’s say you finished deciding what and how many your classes will be for your multi-class classification problem. Now you have to start labeling your data. Many labeling projects don’t involve only one labeler. Expertise-based labeling methods promote the idea of using several labelers to annotate the same data. The goal here is simple; even if it is time-consuming, we want to improve the quality of the labels. Indeed, by doing this, we make sure that the labels given to the data are correct.
A way to better understand this is to look at ML models as overspecialized humans. If humans disagree among themselves, a model won't do better. Or even if it did, it's a sign: your labeling plan (the definition of the classes) is not clear enough, or that you split two classes when you shouldn't have.
To manage the fact that many people will annotate the same data, we have to use two notions. The first is the consensus, to measure the agreement between different annotations for a given data. The second one is the honeypot, which consists in the setting up of a ground truth for the annotators. We then compare it to annotations given by the labelers. Therefore, we can audit the annotator’s work quality throughout the annotation process.
Let’s have a look at how the consensus and honeypot work with some examples. Let’s say you have two annotators working on an image detection project. They annotate the same image. We then compute the consensus. We measure the agreement between the two labelers, and we show the result below:
And for the honeypot, which helps you measure the accuracy of annotations, you can see the result of its computation in the example below:
The labeling expertise then gives us a way to validate the chosen labeling plan: you try to annotate 10% of your data using consensus between your labelers. Depending on your result, you can continue with your labeling plan or refine it. If some classes are over-represented in your dataset, it may make sense to separate them. On the contrary, if some classes do not get enough representation, or if the annotators disagree on some classes (the consensus will be small), you should group these classes. You can also look at the consensus per category to decide which classes to group.
The labeling plan is a crucial part in many machine learning projects. In the case of a multi-class classification problem:
Make sure to clearly identify the number of classes you need. Things you should take into account are the quantity of your data, the specificity of your provided AI model and your business needs
Once you decide how many classes you’ll use, test it with a small amount of data with different labelers. Use the honeypot and consensus to adjust your plan if it doesn’t fit.
We learned in this article to build an efficient labeling plan by taking the case of a multi-class classification problem. Every machine learning problem has its particularities with respect to how you design your labeling plan. However, the general remarks outlined here on the validation of your annotation plan should prove to be relevant and useful.
Here at Kili, we offer you a platform to annotate your data with consensus and thus create both a successful labeling plan and process. Learn more about Kili by checking our website