LoadingLoading
Date2023-01-27 13:56
Read7min

Ultimate Guide to Data Labeling in Machine Learning [2023 edition]

This ultimate guide covers all the important aspects of data labeling. Find out what data labeling is all about, and how it can improve your enterprise

Ultimate Guide to Data Labeling in Machine Learning [2023 edition]

Data labeling is the task of systematically recognizing and identifying specific objects within raw digital data, such as video stills or computerized images (in the context of computer vision), thereby “tagging” them with digital labels that enable machine learning (ML) models to create accurate forecasts and assessments.

Here is some further reading material on other forms of data audio transcriptiontext annotation in a document. These pertain more to the context of natural language processing.

Data labeling and data annotation techniques are characteristic of the initialization stage when developing a machine learning model. It requires the identification of raw data, and then adding one or more labels to that data, to specify its context for modeling. This preprocessing stage permits the machine learning model to assemble accurate predictions.

For example, digital data annotation can permit autonomous automobiles to stop when approaching pedestrian crossings, digital assistants can recognize certain human voices, and digital security cameras can notice and witness strange or suspicious conduct.

Machine Learning has made significant advances within the last twenty years or so. This has been attributed to concurrent and contemporary improvements in computing processing power and the major breakthroughs achieved in deep learning research. Another key catalyst has been the quantity of digital data that has been collected and stockpiled. These advanced algorithms have progressed rapidly, and their need for digital training data has kept pace too.

A key concept required for successful machine learning outcomes is the technique called supervised learning. Supervised learning is a machine learning approach defined by utilizing labeled datasets. These datasets are designed, over time, to train or “supervise” algorithms to correctly classify data, thereby predicting accurate outcomes. The machine learning model can measure its accuracy by processing the datasets and, therefore, self-learn over time.

How Precisely Does Data Labeling Work?
image annotation of animal

1. Data Collection Process

The collection of a significant amount of digital data – images, video stills, videos, audio files, transcribed texts, etc. – is typically the first step in the process. A large and mixed quantity of digital data will guarantee more accurate computerized results in the long run than a diminutive amount of digital data can provide.

2. Data Labeling

Digital data tagging consists of human enforced data labelers who identify those entities amongst the unlabeled data while operating on a data labeling training platform. For example, human data labelers might be tasked with demarcating whether a digital image comprises people (or not) or how to track a tennis ball during a tennis match within a video.

3. Quality Assurance Methods

The labeled or annotated data must be instructive and accurate to nurture high-performing ML, business models. Therefore, an internal and trusted quality assurance process is mandated to review the accuracy of the digitally labeled data. Otherwise, there will be a high risk that the ML business model will fail to perform satisfactorily.

What Factors Affect the Quality of Data Labeling?

data labeling of image houses

1. Information and Context

With data labeling, there is a key requirement for basic domain knowledge and contextual understanding within the labor workforce. These skills are mandatory to be able to construct high-quality, structured digital datasets for ML business models. Data labelers will label their data with a higher level of quality when they understand the reasoning and context of the data they are digitally annotating.

You can learn more about datasets through these useful links: How to create a dataset for Machine LearningHow to clean a dataset in python

2. Agile yet Inflexible

ML business modeling is an iterative process. Data labeling will naturally evolve as the ML business models' constructive testing and quality validation develop, and key learnings are adopted from their outcomes.

Digital data labeling units must have the ability to be flexible to merge changes that adjust to meet the end users' changing needs or products. A robust, yet flexible, digital data labeling team should be able to react positively to data volume changes, increases with task complexity, and task duration shifts. The more adaptive the labeling team can be, the more successful the ML business projects will be.

3. Communication is Key

Positive and direct communication within the digital data labeling team is vital. Any practical and closed feedback loop is an immaculate means to establish dedicated communication and cooperation between the ML project team and the digital data labelers. In addition, data labelers are always advised to share their key learnings as they digitally label the data, so these valuable insights can be adapted to modify the approach if required.

Some Common (and Important) Missteps to Avoid During The Data Labeling Processdata labeling using bounding boxes

Once a data labeling team is established, performs its duties, and commences delivering valuable results, it can easily get carried away when deep diving into data labeling for any ML business project. Instead, one needs to be mindful of the missteps and pitfalls that may derail any ML project – and certain principles or rules must be adhered to when growing the digital data labeling enterprise.

1. Punctuation Issues, Excess Whitespace, and Case Sensitivity Rules

When performing data labeling, it is vital to ensure that the tag finalization process is merely a continuation of the format process that has been created  – and agreed upon – beforehand. Any new tags that are required to be added must be previously communicated to the entire team. When annotating text, for example, whitespace and other punctuation issues can be problematic. Such differences can generally be rectified using an assortment of computer scripting tools that are available on the market.

Any digital labeling exercise will encounter various challenges to overcome along the data annotation journey. However, occasionally some challenges that are encountered are a symptom of a more intrinsic issue that needs to be dealt with. For example, within the medical imaging industry, there is a highly-skilled requirement for the data labeling workforce to have a base level of medical knowledge and practical experience when viewing intricate images resulting from X-rays, MRI, and CT scans.

2. Nested and Inverted Data Tags

Further issues may arise when dealing with the scenario of inverted or nested data tagging. Let’s use a text document as an example: an initial tagged sentence or phrase may be correct since the digital tags were used according to the order in which they appeared within the sentence. But what about the scenario where the exact same words or phrases occur but in a different order, yet still with the same meaning? We could end up with a totally different order of digital tags, which therefore will attach a totally different meaning to what are essentially the same words, but in a different order. When using nested labels, all of the resources that are digitally labeling the ML dataset must operate in the same format. This issue can be extrapolated to other data types, such as images and videos.

Get started

Learn more!

Discover how training data can make or break your AI projects, and how to implement the Data Centric AI philosophy in your ML projects.

3. Adding Any New Labels to Your Project Mid-Term

Halfway through a digital labeling undertaking, your team may realize that there will be a need for a new digital label that is currently not present in the agreed-upon master labeling list. In one initial undertaking, the team might simply add a new required data label to the master list and thereby inform the whole team to now use this new digital label when required.

But this process should be highly discouraged, and the simple reason is that the labeled data the team has already marked would have to be re-examined to see if the new label suddenly fits into the already label-processed digital data. It is best practice to always compile and agree upon the list of digital labels before commencing any labeling endeavor. If a new data label is devised as a required mid-project, it is ideal to note this occurrence and then review – as a team – how the data label was missed after the data labeling exercise is completed, and how this discovery can be prevented from re-occurring beforehand.

Top 5 Methods for Label Error Detection in Unstructured Data

Here we will cover the top 5 methods for label error detection in unstructured data. You might be surprised to find that there is a range of methods, and they don’t all have to be model-based! In fact, most are workflow-based!

Why is error detection important?

Label error detection is part of the quality control process for model development. In structured data, there are generally more or less objective measures for model performance (e.g., precision and recall, regression measures, separateness measures), but for unstructured data, there is an additional element of human interpretation of both the model outputs and the data inputs and what is subjectively or even semantically “correct.”

Why do error detection?

  • To better understand the data: unstructured data is typically much harder to summarize than structured data. Serving up examples of where labels are uncertain or “bad” can help inform the MLOps Engineer where to focus on data generation efforts, though this process, as you will read, is model-based and therefore can have a higher bar to clear in implementation.

  • To understand if there are underlying issues in labeling instructions: it can quickly become apparent that labels aren’t as you intended them to be if there are certain consistencies in the errors. This is oftentimes due to a lack of clear instructions, especially if there is ambiguity in how to label a certain class.

  • To understand if there is labeling bias: unconscious bias can creep into datasets, especially when dealing with qualitative, emotive, and cultural label classes. While labels may be accurate for the subjects involved, they may not be representative of all subjects belonging to that class (e.g., querying “successful business person” returns only images of middle-aged white men dressed in suits could be construed as a severe bias issue as the model/labels are omitting—or have no examples of—data where successful business people of other ages, races, genders, and fashion-senses exist)

Prevent errors from the start - Interface setup

Preventing errors can be fairly simple, especially when using a labeler-centric platform like Kili Technology. Within the Kili Technology platform, you have the ability to structure labeling jobs that can force compliance with the desired outcome. Tools such as required jobs, specified relations, and handy pop-up tooltips to quickly clarify annotation intent to the labeler are all standard within Kili.

object-detection-job-is-required-specifying-level-of-occlusion-is-not

Object Detection Job is Required, specifying level of occlusion is not

object-relation-between-specific-objects

Object Relation can either be between specific objects or relate any object to one another

Tools like these can help prevent simple errors, smooth out the workflow, and help guide labelers to the correct output.

Catching errors early - Quality Control Workflows

If you’re curating your own dataset from scratch or relabeling a standard dataset, then you have the opportunity to both define what the ground truth is and catch errors at the earliest stage. This is also the best and most cost-effective way of detecting errors—especially for novel/purpose-built AI/ML applications.

A quality control workflow can be as simple as reviewing a randomly chosen subset (say, 100 assets) of the labeled data to validate that the labels are what you expect them to be. If there is higher than a given threshold of errors (say, 10) in your subset, it may warrant further investigation.

quality-control-workflows-in-kili-technology-platform

A more complex review procedure, keep reading to learn more!

Use analytics to spot inconsistencies in Human labels or Model Predictions

While it is technically more total work, the benefit you get from labeling the same asset more than once is a greater degree of certainty that it was labeled correctly, and the opportunity to quickly sift through data based on the level of consensus which can greatly focus the review process covered above.

Consensus works by having more than one labeler annotate the same asset. When the asset is labeled, a Consensus score is calculated to measure the agreement level between the different annotations for a given asset. There are various methods for calculating consensus for different data types. In Kili Technology you can also specify custom consensus calculations to suit your specific needs.

Once you have a calculated consensus score, it is easy to sort by and query your dataset to empower a focused review of data below certain thresholds for agreement. In this way, you are leveraging a bit more up-front work, for more ease of dataset use further down the pipeline.

Much like a human-human consensus, you can also evaluate your model based on data labeled by humans (test and/or validation subset). The calculations for this agreement metric are the same over the data types, and in the Kili Technology Platform, you can easily bubble up the least-agreeing assets for further investigation.

This can help you identify where the model is struggling, and find commonalities in the disagreement between the machine and the annotator. This type of metric can also be used in a reverse way—having a validated “good” model spit out inferences and test new annotators against the machine, especially useful when using the same testing dataset that is known to be labeled well by your model.

Program a custom plugin to check certain rules

If you want to catch some simple errors quickly and automatically, it is easy to program a custom plugin in Kili (Beta Feature) to perform checks (regex, number of objects, name, and signature, etc.) upon certain triggers (i.e., submitting an asset, reviewing an asset). You can also compute custom consensus if you’re not finding the option you need in the Kili Technology's standard!

program-custom-plugin-in-kili-technology-platform

Automating the review of tedious rules in the Kili Technology training data platform

This is a powerful way to automate the review of tedious rules that could trip up even the most seasoned reviewer.

Further, plugins like these can automate workflows as well, as the triggering event can cascade to any number of next state(s) for the asset—they can be returned to the annotation queue, flagged for review (1st, 2nd, nth-order reviews are possible!), provide automatic feedback to annotators, or even sent to another labeler for an additional set of labels for more robust consensus (including best 2-of-3 consensus, etc.)

two-stages-review-process-for-DCAI-model-in-production

The two-stage review process for a DCAI model in production]

Have your model decide for you

If you’re reading this, you’re likely an ML Engineer or Data Scientist—or you know one. So I’ve saved the most technical for last. You can of course take the approach of having your model give you its “confidence” in its prediction. While it is tempting to eliminate as much human labor as possible through the power of coding and cheap, computing, it is best to try to implement the first four methods before diving into the code (you’ll catch things with less effort, and way less frustration with your model).

Ok, now that we have that disclaimer out of the way, let’s talk about model confidence and the many ways that it can be implemented.

To implement machine learning model confidence in a data-centric AI workflow, there are a few key steps to follow:

  1. First, you will need to collect and prepare the data that will be used to train the machine-learning model. This may involve cleaning and preprocessing the data, as well as annotating it to make it intelligible to machine learning models.

  2. Next, you will need to select and train a machine-learning model that is capable of calculating model confidence metrics. Some models use a probabilistic approach, where the confidence is calculated based on the probability of a given prediction being correct. Other models use more complex methods, such as comparing the output of multiple models or using uncertainty estimates derived from the training data.

  3. Once the model is trained, you must evaluate its performance and determine its model confidence metrics. This may involve using tools and algorithms from the Cleanlab github repository, or implementing your own methods for calculating model confidence.

  4. After the model confidence metrics have been calculated, you can use them to improve the model's performance. For example, you may choose to use the confidence metrics to weight the predictions in your DCAI workflow and perform active learning by focusing on the least-confident predictions first, correcting them to enrich the dataset, and finally re-training the model and iterating.

By following these steps, you can implement machine learning model confidence in a DCAI workflow, improving the performance and reliability of your AI systems. At Kili, we have several notebooks that implement Auto ML seamlessly integrated into the Kili workflow, and some even have model error detection modules—check them out for yourself!

automl-error-detection-modules

AutoML in the Kili Technology Platform

Conclusion

In conclusion, label error detection methods are crucial for ensuring the quality and accuracy of labeling workflows in machine learning. These methods involve setting up interfaces that allow for easy and efficient labeling, implementing quality control workflows to identify and correct errors, using analytics such as consensus to determine the reliability of labels, using custom plugins to check for adherence to specific rules, and leveraging model confidence metrics to assess the accuracy of the labeling process. By utilizing these label error detection methods, organizations can improve the reliability and effectiveness of their machine-learning models.

Bibliography/webography

GitHub Repos if relevant

Data Labeling Best Practices

Acquiring high-quality digitally labeled data is an expansion barrier that becomes more significant when contemplating the handling of complex ML models. Yet, there are many strategies and approaches to improve the efficacy and precision of the data labeling process:

  • Clear Labeling Instructions: Communication with the digital data labelers and providing clear labeling requirements will ensure the desired accuracy of the delivered results.

  • Label Verification: It is necessary to audit the digital labels to confirm their exactness and make any adjustments if required.

  • Learning Transfer: Another method to improve theefficiency of data labeling is to reuse previous data labeling assignments to create hierarchical data labels. Essentially, ML operators can utilize the output of previously digitally-trained ML project models as input for another new project.

Conclusion

AI is revolutionizing the course on how we perform certain repetitive functions, and enterprises that have adopted such business models are gaining benefits. The technological opportunities AI can deliver are potentially inexhaustible and will assist diverse industries in becoming smarter: from medicine to agriculture, to recreation and sports. Data digital annotation is the first stage of such innovation.


An article by Michael Van Meurer

Solution Engineer @ Kili Technology

Other articles on topic
LoadingLoading
Get started

Get started

Get started! Build better data, now.