Data Labeling

Computer Vision

The Power of Data Labeling & Quality Assurance: a Case Study with Kili Technology & Quality Match

While the digital world is moving from a code-centric to a data-centric paradigm, building datasets remains a complex task. Many companies around the world are currently trying to make dataset building as easy and efficient as possible. Among them, two contestants stand out: Kili Technology and Quality Match. Let's dive into how to use them to your advantage!

Kili Technology

Dec 4, 2023

Heading2

Heading3

In our digital era, characterized by the prevalence of data, the need for well-annotated datasets is omnipresent. Whether it is for the purpose of training machine learning models or enhancing the accuracy of corporate decision-making, the importance of high-quality labeled datasets cannot be emphasized enough. Nevertheless, creating datasets that are extensive in size, coverage, and data diversity while ensuring accurate annotations is a complex task.

The process of developing robust training datasets is further complicated by the absence of an established Quality Assurance (QA) framework for dataset construction. Data labeling for ML model training is a new industry that is evolving, relying on trial and error, best practices, and new research. Engineers are actively seeking guidance in this realm.

Build high-quality datasets to ensure the success of your AI project

The quality of data plays a pivotal role in any AI project. Start with Kili today to label your data with speed and accuracy.

Try for free

This article will delve into how data labeling plays a pivotal role in the success of your AI project, emphasizing the critical need for implementing a robust QA process. We will also provide insights into efficient setup strategies for both aspects, using Kili Technology and Quality Match. By the time you finish reading this article, you will have a clear understanding of how to create the most effective data labeling solution. Let's get started!

Why is Data Quality Important?

The importance of data labeling accuracy is paramount in machine learning and artificial intelligence. Research has consistently demonstrated that a 10% decrease in labeling accuracy can have a substantial ripple effect, resulting in a 2 to 5% decrease in model performance. 2 to 5% in model performance is a massive gap: it is like a self-driving car driving over 5 pedestrians out of 100. In an attempt to mitigate this drop in performance, ML teams usually inject more data into the dataset, leading to larger datasets and, subsequently, higher labeling costs. This does not solve the issue: more data does not automatically mean better performance when the data is poorly labeled.

Consequently, prioritizing data quality right from the outset is a prudent approach to effectively oversee both performance and resource allocation. Enhanced data quality translates to a reduced requirement for data labeling. Smaller datasets lead to decreased infrastructure expenses while simultaneously yielding superior model performance.

Beyond its financial and performance implications, putting label quality as a number one focus also has a massive impact on limiting biases within datasets.

Biases related to factors such as race, gender, and xenophobia can propagate through machine learning models and impact the performance and adoption of the final product. Therefore, ensuring high-quality labels is not merely a matter of technical proficiency; it is a fundamental step toward building ethical and unbiased AI systems.

Why is Achieving Data Quality Difficult?

Crafting accurate and effective AI models goes beyond just coding. In the contemporary landscape, data has taken center stage, marking a departure from the early 2000s when code dominated digital interfaces. This shift underscores the critical role of dataset management. While the significance of writing well-structured and precise code remains undiminished, it's increasingly apparent that enhancing datasets yields more substantial improvements in the performance of AI models than refining algorithms. Consequently, the core focus of machine learning teams is transitioning from model development to the cultivation of high-performance datasets.

However, achieving data quality is a difficult task. Today, it is estimated that 3,4% of the data in public datasets has erroneous labels. Quickdraw (a dataset by Google) records an error rate of 10%, while ImageNET is at 6%. StableDiffusion, trained on the LAION dataset, is full of errors like wrongly detected language or non-secure content.

Why is that? Well, even seemingly straightforward labeling tasks present complexities. Let’s take the example of detecting a kite surfer in an image. Should the bounding box include the surfer, the sail, or both? Establishing clear and precise labeling guidelines at the project's outset and ensuring consistent adherence throughout are essential. Classification tasks, even binary ones like 'yes' or 'no,' introduce complexity: should an Easter chocolate bunny qualify as a 'bunny'? Is a buttered toast qualified as bread? Or is it transformed bread? Should the melted butter on the surface be labeled as butter?

Moreover, a well-balanced distribution of data within the dataset is crucial, as imbalances often prevail. For instance, in the automotive industry, imbalances may manifest as a lack of data for scenarios like vehicles with blinkers on, yellow traffic lights, or unique vehicles such as sidecars and quads. Similarly, addressing rare edge cases, such as zigzag road lines or clusters of ten traffic lights, becomes crucial, especially when training models for self-driving cars.

In essence, saying that data quality is paramount to ML model performance is fine, but achieving data quality excellence is a complex task that the industry has not yet perfected.

The Strengths of Kili Technology, the High-Quality Training Data Platform

So, when choosing a tool to label your data from beginning to end, it’s essential to look for the right tools that will make data quality excellence possible.

Kili Technology is a versatile tool designed to streamline data labeling: boost productivity, maximize label quality, all fully integrated into your existing ML stack. Recognizing that labeling can be a laborious task, Kili Technology prioritises user experience, offering labelers a comfortable and efficient platform with productivity-enhancing features such as keyboard shortcuts and ML automated labeling functions (model-in-the-loop, GPT-automated text labeling, automated object detection based on the highly efficient SAM model, etc). This user-centric approach aims to minimize the time and effort expended, making it conducive for labelers to work for extended durations with strong attention to detail.

Furthermore, Kili Technology places a strong emphasis on quality defense strategies. Acknowledging the inherent complexity and repetition in labeling, it acknowledges that even the most skilled labelers can make mistakes. As a remedy, Kili Technology empowers users to fortify quality through different quality strategies:

Review

Use reviewer permission levels to go over the labels created by the labeling team, add comments, raise issues, validate or send back assets that are erroneous. Keep a tight grip on quality through monitoring.

Watch video

Quality Metrics

Generate quality metrics such as consensus or honeypot to get a global view of the label agreement between labelers or a labeling score based on ground truth.

Watch video

Plugins & Programmatic QA

Automate QA steps with custom plugins, i.e. custom code modules that will automatically detect errors and flag them for your reviewers’ attention.

Watch video

Orchestration of All of the Above

Run all quality strategies consecutively or in parallel, and automate each step of the way as you see fit.

In addition to its commitment to label accuracy, Kili Technology recognizes that labeling is an ongoing process marked by the need to label new assets as the dataset moves and expands: refine existing labels, and extend applications to new use cases. Efficiency and label quality hinge on establishing a robust iteration loop to address issues promptly. Kili Technology facilitates this iterative approach with effortless integration features, including Single Sign-On (SSO) for user convenience, seamless cloud storage integration, simplified and customizable data export capabilities, and automation tools for efficient project pipeline management such as subsequent iterations being pre-labeled with pre-trained models.

In essence, Kili Technology serves as a comprehensive solution that simplifies labeling operations, enhances productivity, and upholds stringent quality standards.

Simplify your LabelingOps

Integrate labeling operations on Kili technology with your existing ML stack, datasets and LLMs. Let us show you how.

Book a Demo

The Strengths of Quality Match, the First Dataset Quality Assurance Platform

To reach its full potential, data annotation must be synergized with data quality assurance, together forming a cohesive strategy to generate truly high-performance training datasets.

Quality assurance operates as the diligent overseer of the labeling process, swiftly detecting and rectifying errors and inconsistencies. This proactive approach not only enhances the quality of data annotation but also accelerates the creation of better datasets, preempting potential pitfalls during the machine learning model's training phase.

The role of quality assurance extends beyond simple error detection. It also provides the dataset creation process with accountability and confirmability. In a world where regulations are tightening, as evidenced by initiatives like the EU AI Act, the call for transparency and reliability in the scrutiny of AI-based systems has never been more pronounced.

Quality Match (QM) offers a platform that delivers tools to flexibly create processes in a repeatable, software-defined way to produce quality-assured datasets with statistical guarantee, marrying transparency with cost-optimized processes. In doing so, it not only fulfills the immediate need for accuracy but also ensures auditability and compliance with emerging standards, reinforcing the integrity of the AI-driven applications.

An effective labeling process must be iterative, emphasizing rapid and streamlined feedback on data quality. This allows for immediate improvements, ensuring that the data is refined and ready before being used to train an AI/ ML model.

In this context, quality assurance (QA) is key in detecting and correcting errors in visual datasets. Three main sources of errors often emerge:

Human Error: People can make mistakes, and it's not uncommon to find that 10-30% of errors in human-labeled datasets stem from simple human error.
Bad Sensor Data: There are times when the data itself presents challenges. Whether it's images or lidar that are dark, blurry, or obstructed, even the most skilled human labeler might struggle to interpret it correctly.
Incomplete Taxonomies: This occurs when the labeling guide lacks sufficient clarity or completeness. Even with an expert labeler and crystal-clear data, questions may arise if the guidelines fail to address every possible situation. For instance, how should a person in a wheelchair be categorized as a pedestrian? What about a police officer walking across the road?

QM does more than simply correct these errors. It also:

Compares the old and new error rates following the quality assurance, particularly concerning human error.
Identifies ambiguous cases arising from poor sensor data.
Highlights edge cases that result from inadequate taxonomies.
Assesses and categorizes each error by type and severity, along with other potential attributes.
Establishes statistical confidence for each assessed annotation.

By paying attention to these aspects, QM quality assurance process ensures that data labeling is not only accurate but also precise, well-defined, and fully aligned with the goals of the AI training model.

QM QA processes are systematically structured to ensure efficiency and accuracy, with the following steps:

Analyzing Label Guides: Reviewing the label guides for the object being checked, specific to each object class.
Creating Taxonomies: Developing a taxonomy for each object that needs to be checked, laying the groundwork for the QA metrics.
Designing QA Processes: Designing and testing the QA process and pipelines, tailored for each task or object class.
Consolidation: Consolidating the QA processes and object taxonomies after rigorous testing.
Ramping Up: Implementing and scaling the QA process as needed.

In addition, QM technology is build upon two core principles: nano-task and the repeated execution of nano tasks.

The first principle consists in breaking down complex QA tasks into individual, one-dimensional decisions, referred to as 'nano-tasks.' This method significantly eases the reviewer's burden by isolating each decision or attribute within an image.

By treating each part of an image as a separate task, the risk of missed errors or inaccuracies due to cognitive overload is minimized. This segmentation also allows for easier identification of ambiguity or edge cases.

Eg:

The nano-tasks are organized into logical decision trees, structuring the decision-making in a hierarchical fashion beginning with broad questions and narrowing down to specifics based on previous answers, ensuring that no superfluous questions are asked. This streamlines the time and resources, making the process more efficient.

The technology's second core principle involves the repeated execution of nano-tasks across different reviewers. This repetition serves to discover ambiguities or edge cases and to allies to provide statistical power on the QA results.

In this context, statistical power equates to the reliability with which we can trust the QA results. Gathering a variety of responses allows for an aggregated and analyzed outcome, reflecting the reliability of the combined answer. The more robust the statistical power, the higher the confidence in the result.

If a task is straightforward, early stopping of repeats can lead to cost savings. If reviewers disagree, it signifies potential ambiguity or edge cases, warranting further scrutiny through decision trees to uncover hidden complexities.

By deconstructing label guides into decision trees, each nano-task within can be repeated until statistical convergence. If convergence isn't reached, it signifies an ambiguity in the annotation, exposing edge cases. By recognizing edge cases and ambiguity early we preemptively mitigate potential error sources.

Combining Both

Combining both solutions into a interconnected tooling suite is easy, as both tools have robust and easily accessible APIs. To dive into Kili Technology’s API, you can read the documentation.

The ROI of Great Labeling and QA

Sharing stories is much more impactful than theory. Here, we’re going to dive into two customer stories that have used Kili Technology and Quality Match to label datasets to perfection.

The first customer, a technology-driven agricultural company, embarked on a mission to revolutionize sustainable farming practices through the implementation of a failproof computer vision model. The primary challenge was the accurate detection of weeds in soil, with the objective of enhancing crop production and resource management.

Before adopting Kili Technology, they struggled with the limitations of open-source tools as their project expanded. Upon integrating Kili Technology into their workflow, they enjoyed its versatility and user-friendly interface for diverse labeling tasks, coupled with the automation potential of its API. In a span of just three months, they achieved a remarkable 30% reduction in labeling costs, amplifying their labeling efficiency.

Our second use-case involves Quality Match. They collaborated with a customer on a project that scrutinized the efficiency and cost-effectiveness of two quality assurance methods. The customer existing object detection model with Quality Match quality assurance system vs. the traditional adder-approver quality assurance method. Additionally, the customer sought to reuse the corrected data to bootstrap from existing machine learning models.

The project revolved around the detection and correction of both False Negatives (FNs) and False Positives (FPs) within three object classes. Given the safety-critical nature of this project, high precision and attention to detail were of utmost importance, especially in edge cases such as object reflections and illustrations.

Our QA project began with an in-depth analysis and collaboration with the customer to understand their goals and extract key questions for the QA. These questions, transformed into nano-tasks were, connected and visualized as decision trees. We call this a QA pipeline.

We created two distinct pipelines:

Detecting and Correcting FNs: We broke down the frames into "image tiles" or sub-frames and QAed them individually. This allowed reviewers to concentrate on specific areas and detect even the smallest objects.
Detecting and Correcting FPs: We crafted a series of nano-tasks for this pipeline, culminating in a decision tree to identify true positives or false positives.
The project yielded promising results
- FP Detection: With statistical power higher than 98%, we identified 15.88% of annotated objects as FPs, and their taxonomies and error severity were precisely cataloged.
- FN Detection: The rate of FNs was extremely low.
We recommended the customer to focus on FP detection in the future

Conclusion

While the digital world is moving from a code-centric to a data-centric paradigm, building datasets remains a complex task. Engineers all around the world are looking at the best ways to build their data labeling pipeline, and learn about best practices others have discovered through trial and error.

To solve that issue, many companies around the world are currently trying to make dataset building as easy and efficient as possible. Among them, two contestants stand out: Kili Technology, with their relentless focus on data labeling efficiency and quality and Quality Match, with their innovative and unparalleled approach to structuring and processing QA-related input and drawing meaningful conclusions. When used in tandem, these two solutions provide the Holy Grail of synergy that ML engineers have been looking for.

Subscribe for updates

Stay updated with the latest news, articles and update directly into your box

January 9, 2026

Data Story: How the Corpus, Synthetic Pipelines, and Evaluation Shaped Deepseek V3.2

This article breaks DeepSeek V3.2 down end-to-end—from continued pre-training to specialist distillation to mixed RL to evaluation—focusing on how training data is built, curated, and used as a control surface for model behavior, reasoning capabilities, and model performance.

Kili Technology

Foundation Models

January 8, 2026

Data Story: Breaking down the training, fine-tuning, and evaluation data of SAM 3

This is a mega article breaking down Meta's extensive work and documentation on the data engine to build SAM 3.

Kili Technology

Computer Vision

Data Labeling

January 2, 2023

Our Complete Guide to Video Annotation (2026 Update)

Whether you're building training data for a cutting-edge autonomous system or developing retail analytics, video annotation is the foundation of computer vision success. The right combination of skilled annotators, efficient video annotation tools, and robust processes will help you create the accurate video annotations your AI models need to perform in real world applications.

Kili Technology

Computer Vision

Data Labeling

The Power of Data Labeling & Quality Assurance: a Case Study with Kili Technology & Quality Match

Table of contents

Why is Data Quality Important?

Why is Achieving Data Quality Difficult?

The Strengths of Kili Technology, the High-Quality Training Data Platform

Review

Quality Metrics

Plugins & Programmatic QA

Orchestration of All of the Above

The Strengths of Quality Match, the First Dataset Quality Assurance Platform

Combining Both

The ROI of Great Labeling and QA

Conclusion

Subscribe for updates

Related articles

Data Story: How the Corpus, Synthetic Pipelines, and Evaluation Shaped Deepseek V3.2

Data Story: Breaking down the training, fine-tuning, and evaluation data of SAM 3

Our Complete Guide to Video Annotation (2026 Update)

Ready when you are. Start your free trial.