My State-Of-The-Art Machine Learning Model does not reach its accuracy promise: What can I do?
Data Quality as a first response
The ultimate goal of every data scientist or company that builds ML models is to create the better model with the highest predictive accuracy in production. Usually, we start with state-of-the-art algorithms being available online which come with astounding performance. Yet, once trained and deployed, accuracy is rarely similar on our dataset. So where can we start to reach the accuracy our application desperately needs?
A built Machine Learning application has needed two ingredients, a Machine Learning algorithm and data, if the algorithm used was state-of-the-art we can conclude that data is actually the culprit. So if the data is of poor quality, regardless of how good a Machine Learning model is, the results will always be subpar at best.
We have to keep in mind that ‘garbage in, garbage out’.
Our goal is to increase model accuracy predictions by focusing on good data quality, to that end we can improve data labels’ quality or increase dataset size.
To improve data labels’ quality we need to understand first how it is generated. The process of construction of labeled datasets is frequently done with crowdsourcing or is involved with some automatic system, these methods are error-prone according to a study conducted by Google. Recently another paper from MIT researchers estimates an average of 3.4% error in test sets of the 10 commonly used datasets. A most alarming fact as in some applications 99% of accuracy is not even enough, that is the case for fully autonomous self-driving cars or medical applications, just to name a few examples.
Annotating with crowdsourcing and automatic systems could bring results to organizations that are not expected and especially not satisfying. One of them is that the accuracy of a state-of-the-art supervised machine learning model would not be as good.
We will explore the advantages of having a clean dataset and having a lot of data with noise. Then as data quality limits the accuracy of supervised machine learning models, how do we clean them? To answer this question, we will explore some techniques and tools.
In the end, it will be shown the crucial importance of data quality to get the maximum potential out of our Machine Learning models.
Data quality make a real impact on models
How accuracy decreases with noisy labels
An experiment done inside the Kili team shows how the accuracy decreases when data is not correctly labeled. In this experience, we took the image dataset Cifar-10 to train the well-known ResNet50 machine learning model. Cifar-10 is a dataset of sixty thousand images with 10 classes uniformly distributed. It means that for each class there are six thousand images. We assumed for our experiments that the original dataset Cifar-10 has no noise.
We define noise as the percentage of correct labels which has been changed by incorrect labels. An example of how to convert a correct labeled asset into a noisy labeled asset could be a horse image that is labeled horse then changing the label to, for example, ship.
The image below shows images taken from Cifar-10 in which noise has been added.
In the experiment, we changed the labels of a subset of the whole image dataset. The chosen images were taken randomly and the subset was chosen by a percentage ranging from 0% to 30%. After that, we merged classes to have a unified dataset. Afterward, with our noisy dataset created, we use it to train the model ResNet50.
After model training and having the accuracy for each test set we can highlight that as data quality decreases, the asymptotic value of the maximum accuracy on the test set decreases.
The maximal accuracy of the model with 0% of noisy labels is 0.72, 0.66 with 5% of noise, and 0.63 with 10% noise. From this information, we can found that accuracy can drop up to 8% when data quality loses 5% and 13 % when data quality loses 10%.
Accuracy can drop up to 8% when data quality loses 5% and 13 % when data quality loses 10%.
Furthermore, accuracy seems to be less stable. Local minimums appear with the 5% and 10% noisy labels example. These local minimums could give an unsatisfactory trained model which can be threatening for our production application.
After having confirmed that our machine learning algorithm is learning because it does not make a random choice, however, with the same Machine Learning model, the accuracy decreases when data quality decreases.
To go deeper in the analysis, we wanted to explore the impact of data quality on accuracy by changing dataset size with some degrees of noise.
For this second experiment, we extracted a percentage of the dataset, then we add noise labels, and finally, we trained the ResNet50 neural network with the reduced dataset. We chose to reduce by 20%, 40%, 60%, 80%, and 100% the dataset.
After training and comparing the results it should be emphasized that you just need to label two-thirds of your dataset to reach the same performance as a dataset where there would be 5% of incorrect labels. And when there would be 10% of incorrect labels in a dataset the performance is the same as if you just label correctly over half of the dataset.
You just need to label two-thirds of your dataset to reach the same performance as a dataset where there would be 5% of incorrect labels.
Besides, the sensitivity to the first 10% noisy labels is high, from the results, with ResNet50 for each 1% of noise added to the dataset, we lose 0.85% on accuracy.
So with less data but cleaned and curated we have better results than with much more data but with noisy labels.
What we have seen so far shows that data quality is crucial to obtain better results in an effective way. This leads us to the following question: how do we ensure quality data for the models?
Improving data quality through efficient workflow and tools
To tackle the problem of data quality there are many tools, methods, and algorithms that can be leveraged. We will explore how to improve the process of dataset annotation and how to detect mislabeled assets.
Value of working with professional workforce.
The application of Machine Learning in many areas such as biology, chemist, physics, medicine, and many others needs specialized datasets. To carry out the elaboration of these datasets and Machine Learning models, data scientists must collaborate with professionals of the dataset field because of their insights to better annotations. That approach is used in best-class laboratories both public and private. Teams that have created breakthroughs have followed this pattern, an example is AlphaFold 2 that is using the PDB-annotated data to predict protein structure from its sequence of amino acids.
The professional workforce is critical since machine learning models’ real “intelligence” comes from a properly labeled dataset. Therefore, the right people to label specialized data are the specialists in the subject. They could be doctors, engineers, scientists, farmers, among others. But are professional workforce enough to get perfect labels?
Well-defined instructions are essential for labelers.
Does the work of all of your labelers look the same? This is relevant whether you have 2, 5, or 89 data labelers working on the same dataset because, after model training, predictions may take unexpected forms.
In the image below, we can see three different ways to label one class of kitesurfing. Labelers can annotate to all kites with people in a single annotation, they can annotate only the kite separately or they can annotate the kite with his surfer separately. More generally, labelers can adopt different ways of annotating.
With this example, we can understand that agreement in annotations cannot be reached without well-specified instructions on how to label a class. Then if there is no agreement, our model accuracy decreases because our machine learning model gets confused learning different patterns for the same object detection.
Moreover, professional labelers are not exempt from these errors, on the contrary, this is even worse because of the overconfidence bias that gives them the impression that they are doing the only right job as they are quite confident. In our example, we show an image example, but this is true for any type of asset, it could be an image, video, text, audio, etc.
Once instructions are well-defined and a professional workforce annotating, what if we have a clueless labeler who has understood the instructions but makes several mistakes anyway? The following methods help to mitigate labelers’ errors.
Consensus and gold standard as methods to measure agreement in annotations.
We have already seen that the instructions are very important for labelers to follow a single unambiguous labeling pattern. However, we have not detailed how we can detect or measure these discrepancies.
We will explore two complementary methods to make consistent data annotations: consensus and the gold standard.
Consensus involves labeling data multiple times and only using assets if all the labelers agree on their annotations. With many annotations, we can create a metric that represents the agreement in annotations, in image assets, it is possible to use the well-known intersection over union method.
To a better understanding of the method, we introduce an example, we could imagine you were training an object detection model to recognize kitesurfing objects. Unfortunately, in some annotations only kites were labeled whereas in other annotations kites with their surfer were labeled. To improve the correctness of annotations, we can include a second labeler that annotates the kites. Then, correct or remove the examples for which the labelers disagree.
Here we have an example of consensus in an image asset. The first image has a better consensus score, so a better agreement, than the second image. More precisely, the consensus score for the first image is almost perfect whereas in the second image the consensus score is lower which could be a disagreement.
If there is a high consensus, we can say that labelers are following the same pattern, otherwise, with a low consensus, we should clarify the labeling instructions or correct the labelers that are not following the rules.
Now we can imagine a case where the work of all of your labelers agrees but does not follow correctly all the instructions. The gold standard allows monitoring that annotations follow a pattern that we defined a priori as correctly labeled.
In the same way, that the consensus compares the similarity of annotations made by different people, gold standard also compares annotations, but this time it compares labelers’ annotations with a set of annotations that we put as ground truth ie. labels that we define as correct labels.
Ground truth is a set of annotations very carefully made and that followed all the guidelines to make an almost perfect annotation. Once ground truth is made, labelers can start labeling the dataset. Once all assets are labeled we can compare ground truth annotations with labelers annotations. If there is a high agreement between ground truth and labelers annotations, labelers are doing a good job following the instructions correctly. If there is not full agreement it means that labelers have to correct their annotations.
For example, again taking the example of classifying kitesurf. We set an instruction that says:
- To annotate the kitesurf class, it must cover only kites and it does not take into account their surfers.
With gold standard we have a reference from which we can measure whether the assets are being correctly annotated.
In the end, the secret sauce is to iterate on instructions and consensus to refine instructions so that all labelers annotate consistently.
Leveraging best practices from software development to annotation: Review Workflow + Questions + Issues.
We have just introduced the tools needed to start labeling correctly by giving precise instructions, detecting errors, and measuring the differences between annotations through consensus and the gold standard. To continue with our process, the next step is to describe the communication flow between labelers and reviewers.
Inspired by best practices in software development workflow, reviewing is crucial to build better products. Reviewing in software development allows transferring of knowledge from more experienced developers to more junior developers. This allows to improve code quality, prevent edge case error, could improve performance, among others.
To adapt these good practices in software development to annotation platforms, they must implement a series of tools to make collaboration between labelers and reviewers fast and effective. To this end, labelers and reviewers should have access to an adapted interface for each one. Within the labeler interface, it could be useful to be able to ask questions to reviewers in a chat-like or in a forum-like interface.
On the other hand, reviewers, besides being able to access the labelers’ questions, should be able to create issues when they detect a fault in the labeling. In the interface, reviewers should be able to see consensus and the gold standard which we have already described above and as we have already mentioned is of great importance.
All guidelines described before combined can not fail in helping you towards much better quality data, that will ensure the performance of your model. Those are the practices that are proven to work, as the leaders of the industry that all have 10 years of experience in his area have shown us, and are proving their superiority in the applications they develop.
A very active field under experimentation.
The best way today to create a labeled dataset of excellent quality is to make the professional workforce annotate datasets through a correct workflow. However, technology can help, and dedicated algorithms and automated tools could be a great help to professionals in the task of labeling datasets.
Research trying to solve the problem of noisy datasets is carried out by big players both from academia and the private sector such as Google, Amazon, and MIT, the same that started creating labels when most companies were still struggling with their digital transition. Today, they are still actively improving how we ensure the best algorithms by fueling them with the best data. The research described was published in 2021 which shows the major relevance of data quality today.
Recently last year MIT researchers created Cleanlab as a tool to find label errors in image, text, and audio datasets. The scholars tested their tool with the 10 most popular image, text, and audio datasets, including Cifar-10, ImageNet, IMDB, and AudioSet.
In one of their publications, they show how these datasets, which are widely used to benchmark new and improved Machine Learning algorithms, have errors. The image below shows some errors that were found in the image labels MNIST dataset.
These new labeling error detection algorithms are recent, and their effectiveness has to be tested in environments external to academia.
Conclusion and next steps
In this work, we provide empirical evidence about the impact of data quality on model accuracy but also that it is more efficient to improve data quality to increase dataset size to increase accuracy models. Then to improve data quality, we adopt the best practices from software development workflow and exhibit how much interesting it is to use labeling platforms to improve data quality to get the most value from your model.
To go further, it remains to test Cleanlab with datasets outside of the academic fields, on real-world data, investigate the speed of its algorithm and check its usability with large datasets.
At Kili Technology we hope to create awareness of the value that could come from building excellent data. Improving data quality, supervised models could reach their value promise and then take them to production, at scale; conversely, it allows discarding models that are the best in our laboratory but with average results in production and thus focus on models and data that really matters.
- He, T., Yu, S., Wang, Z., Li J. & Chen Z. (2019, June 10). From Data Quality to Model Quality: an Exploratory Study on Deep Learning. arXiv.org. https://arxiv.org/abs/1906.11882.
- Northcutt, C., Athalye, A. & Mueller, J. (2021 April 8). Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. arXiv.org. https://arxiv.org/abs/2103.14749.
- Northcutt, C., Jiang, L., Chuang, I., (2021 April 14). Confident Learning: Estimating Uncertainty in Dataset Labels. Journal of Artificial Intelligence Research. https://doi.org/10.1613/jair.1.12125
- Northcutt, C., Chuang, I. & Athalye, A., Cleanlab, Github repository, https://github.com/cgnorthcutt/cleanlab
- Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L., (2021, May 6). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, Article 39, 1–15. https://doi.org/10.1145/3411764.3445518
- Press, Gil, (2021, June 16). Andrew Ng Launches A Campaign For Data-Centric AI. Forbes.com. https://www.forbes.com/sites/gilpress/2021/06/16/andrew-ng-launches-a-campaign-for-data-centric-ai/