Data scientists and machine learning (ML) teams often struggle to move their model into production or get to a level of performance that really provides value. Gartner shared that about 80% of ML projects never reach deployment, and those that do are only profitable about 60% of the time.
The challenge of most of these teams is not so much the quality of their model algorithms, which more and more are standardized for most use cases. Their real challenge lies in the setup of a qualitative dataset at the model training stage.
Training a model and getting good results is not easy, and even if you did, it does not mean that a model would be successful in the real world. Indeed, ML performance depends heavily on the quality of the training dataset. If the dataset is filled with corrupted or bad quality data, if the dataset is unbalanced or biased, if labels are wrong or inconsistent, then what your ML model learned will set it up for failure. The challenges met by ML engineers and in the training phase are various. They range from:
collecting the right data to building a comprehensive dataset;
properly defining the goals and governance of the project;
preparing the workforce to be as productive and efficient as possible;
enforcing strict quality assurance control.
In this article, we will review what we consider to be the best practices for managing data annotation projects, so you can successfully ship your ML models into production and start creating real value for your organization.
The basis of a data annotation project
A data annotation project has 7 key phases:
define the annotation project: it is key that your first step consists in getting a clear understanding of what your goal is and how you want to achieve it. It will set the guidelines for what type of data you need to collect, decide how best to collect this data, determine the type of annotations required, how the annotations will be used, and of course, the budget, resources, and time required. Also, try to push multiple small projects over one big annotation project: the simpler and annotation task is for the annotators, the better the quality. Finally, don't forget to clearly identify your key stakeholders and communication processes to make sure your project is managed and tracked in the most efficient way.
prepare your dataset: prepare a dataset as diverse as possible. More important than quantity is the diversity of your data, so your model can avoid bias and cover all edge cases. For example, let's say you want to build an ML model that will identify people crossing a street. Make sure your dataset includes all types of people, with all types of dress codes, crossing the street by all types of weather and luminosity.
select the workforce: all data annotation projects require a human workforce. However, all humans are not able to label effectively all types of annotation projects. You can't just ask anyone to label medical images from brain scans. Identify the workforce you need based on your project data: if it is simple, the workforce can be crowdsourced. If it is more specific, you might need to hire SMEs. Sometimes, these SMEs are available within your organization. Other times, you will need the services of a workforce provider.
pick and leverage a data annotation tool: there are many annotation platforms available. When you start looking for one, the first thing you should consider is to ensure data privacy and access by the workforce. Then, you will want to ensure that it can be easily integrated into your infrastructure. Finally, check that it has the desired UX/UI interface and tools to facilitate your project management needs and annotation use cases.
define comprehensive guidelines and provide regular feedback: quality will set your project for success or failure. We estimated that a 10% decrease in label accuracy impacts a model accuracy from 2 to 5%. To reach the highest level of quality, make sure that you define and communicate clear and consistent guidelines to the workforce. These guidelines will also probably need to be updated and shared as the project progresses. The workforce should also be encouraged to ask questions and provide feedback through dedicated channels.
set a strict quality assurance process: At the beginning of the project, set a quality metric such as the consensus or the honeypot. You can then filter assets (images, texts) where labelers disagree, choose the right label, and enrich the guidelines with those edge/difficult cases. We noticed that usually, it's better to annotate less data with consensus than more data without consensus.
State-of-the-art platforms should also let you map within dedicated visualization tools the annotations so you can easily see if your data annotations are consistent or not. These tools are used to identify anomalies, unbalanced datasets, unclear annotation guidelines, data drift, etc. They allow considering strategic adjustments promptly.
Iterate on your data quality, again, and again: as we mentioned before, your ML model performance will only be as good as the quality of your training dataset. Once your annotation phase is done, and you go into the testing and then into production, never stop monitoring and analyzing your model outputs.
We call this a data-centric approach to AI: we focus on fine-tuning the data rather than the algorithm.
Identify where your models are underperforming and understand why so you can make strategic adjustments to your training dataset and iterate. Best practices are to iterate on every project level, same as for lean and agile management. For example, you can iterate on:
the number of assets to annotate;
the percentage of consensus;
the composition of the dataset;
guidelines given to your workforce;
or by enriching your dataset based on new real-world parameters (covid is a good example of an event that drastically impacted many ML models).
Not only will it increase your model performance, but it will also help you reduce your annotation cost as you will better target the data that your model really needs (we estimate that 40% of training assets are redundant). Finally, the more you fine-tune your training dataset, the better your model will be able to perform in the real world and stay relevant.
Data training and therefore data annotation are the roots of any ML model. Implementing these 7 best practices into your data annotation project and adopting the data-centric AI approach will save considerable time and costs to successfully develop effective ML solutions and create real value for your organization.
How to choose the right Data Annotation Platform
Most companies are only starting their journey towards data-centric AI and have little knowledge of what the market has to offer to support them. Here is a short review of the type of data annotations solutions:
entry-level: Entry-level data annotation platforms provide a solution to start your AI journey, for simple use cases and quick model training but are not collaborative, have poor quality management tools and don't provide scalability.
more advanced: More advanced solutions described as training data platforms enable organizations to address more complex use cases and provide better scaling capabilities. They are good when your use case requires training your model once, but have limitations when it comes to focusing on data quality to improve your model performance and manage the data life cycle.
for data-driven companies: the best data training and management solutions are data-centric. They help organizations that have started their data-driven and AI maturity journeys. At Kili, our data-centric AI approach drives how we design the data annotation process. We enable our customers to understand their dataset better, fine-tune their dataset quality, better foster collaboration between ML engineers, subject matter experts, and the workforce, and iterate quickly and continuously to improve their model performance over its entire lifecycle.
To get more into the details of managing your data annotation projects efficiently, catch up on the webinar replay that our co-founder and CTO Edouard d'Archimbaud has given!