Enhancing a model performance can be challenging at times. I’m sure many of you would agree that you’ve found yourself stuck in a similar situation. You try all the strategies and algorithms that you’ve learned, yet performance does not increase significantly. As a result, we almost always have a model-centric approach, focusing on improving models and not data.
At the same time, high-quality labels for training data are critical for model-building success, as detrimental items in the dataset can adversely impact model performance.
Many reasons could explain why we have a model-centric approach. First, competitions and benchmarks about machine learning are model-centric. Second, in academia, researchers are focused on improving models without much concern for the data itself. This bias is also thus reproduced in companies.
Models are sexy; data is not. This should be rectified.
However, it turns that, after choosing the model performing the best, clean data gives better results on increasing performance than trying to improve models, as Andrew Ng showed recently in a presentation.
Source: A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. Andrew Ng. YouTube. https://youtu.be/06-AZXmwHjo
In Machine Learning projects, once we have labeled data and a trained machine learning model, to improve performance, we need to improve the model or improve the data, or both.
Since improving data could give better results than improving model as shown by Andrew Ng, we will explore in this article a data-centric approach to improve model performance.
In this perspective, when improving data, we have two possibilities: clean data by modifying labels $y$ or increasing data by adding more $(x, y)$.
We will start by showing how to clean the dataset and then increase the data. The first section about cleaning a dataset is divided into two parts: error analysis using some new tools recently developed in academia, followed by relabeled mislabeled annotation.
Finally, we will exhibit most commons methods to increase data.
In our task of cleaning data, we have to keep in mind the dataset size, the nature of our dataset (images, voice, text, etc.) with labels, and, last but not least, the task we want to solve with machine learning.
Dataset size is important because we will prefer some methods over others depending on the time and resources available for the project. For example, when we deal with millions of assets and do not have much time to deliver our product, you can be helped thanks to data cleaning algorithms. Otherwise, if our dataset is in the order of thousands of assets or less, that can be manageable for a group of humans.
The nature of the dataset and labels is important because the complexity of verifying and annotating could be very different; this will be reflected in price and time.
Pricing Amazon Mechanical Turk - https://aws.amazon.com/sagemaker/groundtruth/pricing/
And finally, the machine learning task used is relevant for choosing an error detection algorithm which will be described in more detail in the following section.
We will focus on detecting errors by identifying patterns. For example, they could be categories that are frequently mislabeled, a pattern that does not have enough assets to be learned by the machine learning model, or simply assets mislabeled by annotators unintentionally.
For example, in the image below, there is a roman number one. This image is particular because of the lines outside the number in the image. If this pattern is found in few images inside the training dataset, our ML model will not easily predict the correct roman number.
Now let's separate two cases for error analysis. In the first one, we will have a dataset of no more than a few thousand assets, and in the second one, datasets of more than several thousand assets.
We'd rather have humans as the only labelers in small datasets as they are more reliable than algorithms to detect errors in the labels.
Label assets have to follow a particular workflow to avoid being annotated incorrectly once again. Basically, we will need a trained workforce that annotates our data. They have to follow well-defined instructions on how to annotate.
Even when using a trained workforce, it is highly recommended to use more than one annotator for a single asset to leverage the human agreement as a proxy to ensure a good label quality. Throughout the whole process, human reviewers should also be present to check eventual questions and issues coming from labelers. To read a more detailed description of this workflow, you can read this article here.
Moreover, we can also take advantage of the predictions of a trained ML model by using a human-model agreement. In human-model agreement, we compare the predictions of our model and the annotations of our labelers with some metric depending on the Machine Learning task.
Following the above process, we could realize in which assets we are making mistakes and then take some action to increase the human-model agreement. These could be :
Improve our tagging instructions.
Increase data in the categories where our model makes wrong predictions.
Re-label the incorrect assets.
This same process can be done several times to improve our labels continuously.
The big challenge in making error analysis in big datasets is time. Verification of labels by people could be time-consuming and consume significant financial resources. Therefore, algorithms could help with this issue as a simple computer can process much more data than a human even further computing power can be easily scalable using cloud services.
Different algorithms can be used to detect mislabeled assets depending on the Machine Learning task. For example, they could be classification, object detection, NER, relations, transcription, etc.
The process of how Cleanlab works at a high level is described as follows: first, it trains a classification model with our noisy labeled data, then once the model is trained, the model gives us a predicted probability matrix of size NxM, where N is the number of assets and M is the number of categories. So, for example, if we take the MNIST dataset, which has 10 categories, we obtain a predicted probability matrix $P$ of 60000x10.
Then this matrix $P$ is used together with the labels $Y$, by cleanlab; at the end, the algorithm returns assets which it determines as having label issues.
Here is an example of results that yield cleanlab in the MNIST dataset with a simple CNN model. The MNIST dataset is an image dataset that contained written natural numbers from zero to nine.
The results are interesting because, in most cases, images could belong to two classes. As an example, the last image in the first row could be a number 4 but also a number 9.
The algorithm detects that 0.065% of the MNIST dataset has incorrect labels. Here is another example using Cleanlab over Andrew's data-centric challenge dataset using a simple CNN model. Now 18.67% of the dataset, according to Cleanlab, has errors in their labels.
Although, we can see false positives, as is the sixth image in the third row. That image is clearly a six in roman numbers.
Cleanlab is easy to use, the main difficulty comes from choosing a good model to train over our data, but in the context of this article, we already trained a model that allows us to iterate fast. Therefore, the next part deals with assets that are not classification tasks.
For object detection, tracking, or named-entity recognition (NER) tasks, it is possible to convert them to a classification task by cropping the selection inside our assets. This mechanism generates another data point which is a subset of the original with a class labeled.
For example, when dealing with image object detection, we can have an image RGB of 1920x1080x3 size as $x$ and a label dog with a list of coordinates as $y$ that indicate the section inside the image where a dog is located. Then, we can take coordinates and make another image that we will denote $\hat x$, which is equal or less in size than the original image, and $\hat y$ will be a label dog. So now we have a classification dataset form for $X$ and $Y$. However, this approach does not identify tight or not tight bounding boxes.
The same proceeding can be repeated for NER and tracking tasks.
After converting to a classification task, we can use Cleanlab to detect incorrectly labeled assets.
For other tasks that cannot be converted to classification tasks, it is more complicated to filter mislabeled assets because there is no standard tool to automate our task. However, this article proposes using anomaly detection as an alternative tool for detecting not-correct labeled assets.
Anomaly detection, also outlier detection, finds assets that do not correspond to an expected pattern. As there is no data on anomalies, we are facing a case of unsupervised learning (UL). The most known techniques in UL are:
Clustering: Nearest neighbors
Statistics: Anomalies are points that are less likely to be generated by the model
Information theory: Entropy measure, Kolmogorov complexity
We apply a simple trick in our case for anomaly detection; instead of just using assets (images, video, sound, etc.), we will use as input the asset and the label, a couple $(X,Y)$.
For example, for translation tasks, where we have text as input and output, we can embed both phrases in order to get two vectors. In that manner, we can concatenate them and then apply some anomaly detection algorithm in our labeled dataset.
For transcription, this is a little more difficult because we have images as $X$ and text as $Y$. However, we can adapt our embedding of label $Y$ to append it to $X$. Once our couple $(X,Y)$ is adapted, we can apply anomaly detection algorithms.
The next phase is to reannotate mislabeled assets. Once we have assets flagged with errors, we can send them back to many labelers to annotate whatever type of asset.
On the other hand, as we have already mentioned, when we talked about detecting errors in small datasets, the task of labeling assets has to follow a workflow carefully thought; as an example, you can follow the steps ofthis article.
Check discrepancies between the training dataset and test dataset
We have to be aware that supervised machine learning learns from the training dataset. If there is some asset with a particular pattern in the test dataset that the trained model did not learn because of a lack of examples, it is not surprising that it yields incorrect predictions.
To deal with this problem, we have to check discrepancies between the training dataset and the test dataset. We can use some techniques of dataset visualization as Uniform Manifold Approximation and Projection (UMAP) or t-distributed Stochastic Neighbor Embedding (TSNE), which are embedding methods.
As an example of UMAP for object detection, we have a dataset of cats and dogs. In this dataset, we have an image of a dog with a mesh in front of the pet, this image is inside the cluster of cats which gives us a data point to pay attention, train dataset has to include this image pattern to train the model but also test dataset to check if our machine learning model had correctly learned
We have seen different techniques to clean data. Now that we know that our dataset is cleaner, we can increase our data.
Increasing data is a method to increase accuracy in our machine learning but also to produce a more robust model. First, we will focus on patterns where our ML model does not perform well as we have already detected mislabeled assets. In these kinds of assets, it is recommended to increase data to perform better.
We propose three strategies to increase data:
Data augmentation: Increase data from our dataset.
We could apply different methods to increase data from the dataset depending on the type of assets.
For images, the main methods used typically involve:
Applying non-linear transformation: elastic transform, grid distortion
For text, some methods are:
Back translation: Translate original text to another language, then translate to the original language.
For sound assets, the main methods are:
Data generation: Synthesize data
For images and video, we can use style transfer methods with generative adversarial networks.
Data collection: Create data taken from the real world.
We can create real-world assets, taking photos, recording sounds or videos, etc.
In this work, we proposed a workflow to improve quality issues in our dataset to improve the accuracy of our deployed machine learning services. We compiled practices to build a workflow and exhibit how systematically we can improve accuracy with a data-centric approach to get the most value from your time.
There is a considerable amount of work advance in the subject of cleaning datasets and increasing data. There are currently efforts taking place by academia led by Andrew Ng and top world universities but also in big companies.
Andrew Ng. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. Youtube. URL https://www.youtube.com/watch?v=06-AZXmwHjo
Northcutt, C., Jiang, L., Chuang, I., (2021 April 14). Confident Learning: Estimating Uncertainty in Dataset Labels. Journal of Artificial Intelligence Research. https://doi.org/10.1613/jair.1.12125
Northcutt, C., Chuang, I. & Athalye, A. Cleanlab, Github repository, https://github.com/cgnorthcutt/cleanlab
Maaten, L., Hinton, G. Visualizing Data using t-SNE. Journal of Machine Learning Research. https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
Chandola, V., Banerjee, A., Kumar, V. Anomaly detection: A survey. ACM Computing Surveys, Volume 41, Issue 3. https://dl.acm.org/doi/abs/10.1145/1541880.1541882
Amazon SageMaker Ground Truth pricing: https://aws.amazon.com/sagemaker/groundtruth/pricing/
Data-centric AI Competition https://worksheets.codalab.org/worksheets/0x7a8721f11e61436e93ac8f76da83f0e6
Facundo, F. My State-Of-The-Art Machine Learning Model does not reach its accuracy promise: What can I do?. Kili Technology Blog. https://kili-technology.com/blog/my-state-of-the-art-machine-learning-model-does-not-reach-its-accuracy-promise-what-can-i-do/