Data Labeling

2026 Data Labeling Guide for Enterprises: Build High Performing AI with Expert Data

Learn how modern data labeling combines automated labeling and expert HITL workflows to embed subject-matter expertise throughout the AI lifecycle, improving data quality, scalability, and model performance in production.

Kili Technology

Dec 12, 2025

Heading2

Heading3

Introduction: How expert-driven data labeling transforms unstructured data into high-performing AI

The trajectory of artificial intelligence has reached an inflection point. As models have grown more sophisticated across computer vision, natural language processing, geospatial analysis, and document intelligence, a counterintuitive truth has emerged: the primary constraint on performance is no longer model architecture. It is data.

Modern AI systems are, as researchers describe them, “data-hungry”—requiring significant amounts of labeled data for effective training. This creates challenges in the training process. Data collection, large-scale data labeling, and quality enhancement compound as applications move from research environments into production. Across domains from medical imaging to autonomous systems, organizations face a common bottleneck: vast pools of unlabeled data exist, but only a fraction carry the annotations needed for supervised learning. In clinical settings, for instance, large repositories of pathological images remain undiagnosed because expert labeling at scale is prohibitively expensive. The same pattern repeats in satellite imagery, financial documents, and industrial inspection—wherever domain knowledge is required to interpret raw data correctly.

This scarcity has driven a fundamental shift in how the data labeling process is done. More data does not necessarily mean better AI. As MIT’s FutureTech research notes, AI models are increasingly trained on better-labeled and more diverse datasets rather than simply larger ones, reflecting a growing recognition that quality outweighs volume. The economics of modern AI development now prioritize data that is well-annotated and contextually rich over brute-force accumulation of examples. Andrew Ng , a prominent figure among machine learning practitioners, has framed this directly: if the majority of ML work involves data preparation, then ensuring data quality becomes the most critical task for any machine learning team, according to AIMultiple’s research on data quality.

The industry has converged on a hybrid paradigm: to train better machine learning models, whenever possible, AI takes on data labeling tasks at scale, while humans guide and validate the results. This is driven by the limitations of purely automated approaches.

Research on reinforcement learning from human feedback demonstrates that while AI-generated labels can approximate human judgment, they remain sensitive to prompt optimization, task complexity, model bias, and the predictive capabilities of the labeling model itself. These factors limit the ability of automated systems to fully replicate human annotation quality, particularly for samples that are ambiguous or domain-specific. As the RLTHF research paper observes, the samples that challenge an AI labeler are often the ones most critical for adapting models to specialized tasks.

The hybrid model is now well-documented in practice. According to the iMerit State of AI in the Enterprise study, 96% of companies say that human-in-the-loop (HITL) involvement is essential or nice to have for AI/ML projects, and 86% say it is strictly essential. As a reminder, Human-in-the-loop (HITL) labeling leverages human data labelers' judgment to create, train, fine-tune, and test machine learning models. Organizations combining AI-assisted processing with human-in-the-loop verification in document workflows often report accuracy levels approaching the upper bounds of production-grade performance—thresholds that neither humans nor machines reliably achieve in isolation. This collaboration matters most in sensitive applications such as healthcare diagnostics, financial compliance, and legal review, where errors carry high downstream costs.

Modern data labeling pipeline where the judgement of subject matter experts are deeply integrated from defining standards, to reviewing complex and ambiguous assets, to refining guidelines for data scientists and annotators, to evaluating model output for continuous monitoring and feedback. — Modern data labeling pipeline where the judgement of subject matter experts are deeply integrated to ensure higher quality datasets and better performing AI.

Within this hybrid framework, subject-matter experts occupy a structural role that extends well beyond initial annotation. SMEs are not optional participants called in for edge cases; they are embedded across the entire data lifecycle. In customer-facing AI deployments, privacy and security constraints often mean that only experts within the customer organization have complete visibility into the data corpus. This makes their involvement essential not only for labeling accuracy but also for prompt optimization, guideline development, and validation of model outputs against domain-specific standards. The RLTHF research makes this explicit: when fine-tuning, service providers cannot access sensitive data such as the actual customer data corpus, and SMEs become the only actors capable of ensuring that hard-to-annotate samples receive correct labels.

What follows is an examination of what high-quality data means in practice, how modern labeling techniques operationalize expert involvement, and why organizations that treat SME integration as optional will find their models underperforming where it matters most.

What “High-Quality Data” Actually Means Today

The definition of high-quality training data has evolved from the early days of machine learning model training. In its simplest form, quality once meant accuracy—did the label correctly describe the data point? Today, that definition has expanded to encompass a constellation of attributes that together determine whether a dataset can support production-grade AI.

A modern high-quality dataset is built on gold standards: reference annotations created by domain experts that serve as the authoritative benchmark against which all other labels are measured. These gold standards anchor the entire annotation process, providing clear examples of correct labeling that guide both human annotators and AI-assisted labeling. Without them, teams lack a consistent target, and annotation quality becomes a matter of opinion rather than measurement.

To clarify, managed data labeling teams are still needed, especially when scaling quickly is necessary, and AI models struggle to automatically label data accurately due to complexity. However, crowdsourcing distributes labeling tasks to many non-experts for speed and scale, but quality varies. Hence, subject matter experts need to be involved to validate and guide the entire data labeling process.

With that said, expert-reviewed guidelines form the second pillar. These are not generic instructions but domain-specific protocols that capture the nuances professionals rely on when making judgments. A guideline for annotating medical imaging, for instance, must reflect the diagnostic criteria that radiologists apply—not a layperson’s approximation of what matters. Clear annotation instructions derived from these guidelines reduce ambiguity, minimize inter-annotator disagreement, and ensure consistency at scale.

Consensus processes provide a mechanism for resolving disagreements. When multiple annotators label the same data point differently, the dataset must include a structured approach to determine the correct label—whether through majority voting, expert arbitration, or weighted scoring based on annotator reliability. Without consensus, conflicting labels introduce noise that propagates through the model, degrading performance in unpredictable ways.

Traceability and multi-step validation round out the requirements. Every annotation should be traceable to its source—who labeled it, when, under what guidelines, and with what level of review. Multi-step validation ensures that labels pass through quality gates before entering the training set, catching errors that would otherwise compound during model training. Human feedback remains the golden standard in achieving alignment between model outputs and real-world expectations, a principle reinforced across recent research on preference learning such as deep reinforcement learning from human preferences.

This composite definition yields a modern high-quality dataset that follows a predictable pattern: expert-designed, AI-assisted, expert-reviewed, and continuously monitored. The expert sets the standard for data labeling work. They design the taxonomy and guidelines; AI accelerates labeling at scale; experts review outputs and correct errors; and continuous monitoring catches drift before it degrades production performance. Each stage in the labeling process requires SME involvement—not as a luxury, but as a structural necessity.

The Cost of Poor Data Quality

Poor data fed to ML models has a multiplier effect. Training on error-containing data forces models to learn those errors as features, embedding systematic mistakes into the model’s predictions. A QCon SF 2024 presentation revealed that historical studies show machine learning project failure rates as high as 85%, with data quality issues cited as a leading cause. According to Gartner research cited by Akaike AI, poor label data quality translates to average annual losses of $12.9 million per organization.

The downstream costs multiply. False predictions in production trigger operational failures. In healthcare, IBM Watson Health’s challenges with cancer diagnosis provide a sobering example—the AI system’s recommendations were often unreliable not due to the limitations of the machine learning algorithms, but because of inconsistent and incomplete patient records across different healthcare systems. In financial services, fraud detection systems built on biased or incomplete label data generate excessive false positives, eroding trust and consuming resources. In manufacturing, predictive maintenance models fail precisely when they are needed most, during the edge cases their training data failed to capture.

Retraining cycles represent another hidden cost. When production models fail, teams must diagnose the root cause, curate new training data, retrain the model, and redeploy—a process that can take weeks or months depending on the domain. These cycles are expensive in direct costs and opportunity costs alike. While teams remediate data quality failures, competitors with cleaner high quality labeled data move ahead.

Six data labeling techniques in 2026: automated labeling, hierarchical labeling, monitoring, and more

Modern data labeling is not a single technique—it is an integrated system where different approaches serve different phases of the ML lifecycle. Understanding how these techniques work together, and where expert involvement is required, is essential for building datasets that support production-grade AI. Note that this won't cover synthetic data or generative AI. We'll cover this in another guide.

Automation-first techniques

Model-based pre-annotation

The first category uses ML models to annotate data or generate predictions with rule-based heuristics to accelerate initial dataset creation. In model-based pre-annotation, existing models generate initial labels that human annotators then validate and correct. For example, in natural language processing models used in bioscience, data scientists are already using LLMs for text data classification and named entity recognition. Still, they may call on subject-matter experts, such as biologists or pharmacologists, for domain-specific verification.

This approach reduces the time required to create large datasets. Instead of labeling from scratch, experts review a small sample of the output and provide corrections, which can then be re-fed into the model for iteration. Or, if the team is augmented with data labelers, annotators can base their corrections on experts' initial work or be guided by said experts for larger batches of data. This allows everyone to focus their attention on cases where the model is uncertain or incorrect.

Programmatic labeling

Programmatic labeling extends this logic by encoding domain knowledge directly into labeling rules. Regex patterns, schema-based constraints, and distant supervision create labels at scale without requiring manual annotation of every example. In document intelligence applications, for instance, programmatic rules can automatically identify standard invoice fields, reserving human review for exceptions and novel formats.

The SME role in these techniques is to write the labeling rules, refine pre-labels, identify edge cases, and improve guidelines based on observed errors. Automation accelerates, but expert judgment ensures correctness. When automation fails—and it does, particularly on edge cases, ambiguous scenarios, and novel patterns—SME judgment becomes irreplaceable. A survey of computer vision pseudo-labeling methods notes that quality enhancement using foundation models and active selection strategies represents promising research directions precisely because automation alone cannot resolve the problem of noisy data.

Automated sampling: active learning

Active learning workflow showing how uncertainty- and value-based sample selection directs human and SME effort to the most informative data points. By prioritizing ambiguous or high-impact samples for annotation and expert review, this HITL approach improves data efficiency, supports continuous guideline refinement, and accelerates model retraining and robustness over time.

Active learning addresses a fundamental challenge in dataset creation: expert time is limited, and not all data points contribute equally to model improvement. Rather than labeling randomly, active learning systems identify the most uncertain or valuable samples for human review. The model surfaces examples where its predictions are least confident, directing expert attention to the cases that matter most for improving robustness.

Research on human-in-the-loop active learning describes this as a process in which humans interact with computers to enhance the efficiency of machine learning. The goal is to identify the most informative unlabeled data and entrust them to oracle experts for labeling. The high quality of the newly labeled data enhances the efficiency of machine learning, and consequently, the demand for sample size sharply declines to an affordable level.

This creates what might be called the scalability paradox: as datasets grow, SME time becomes the constraint. Active learning ensures experts focus where their judgment has maximum impact rather than wasting effort on samples the model already handles well. The expert’s role shifts from exhaustive quality assurance labeling to strategic intervention —prioritizing which examples matter not just because they are difficult, but because they improve model robustness in production scenarios.

Production-oriented workflows: continuous labeling and data quality monitoring

A modern, production-oriented approach recognizes that the flow of data labeling tasks never really stops. Whether that's natural language models or computer vision models, there will always be variables that force data scientists to introduce new data — new document formats in financial compliance, seasonal drift in agricultural imagery, evolving threat patterns in security applications, changes in regulations for content moderation, etc. Continuous labeling incorporates data drift detection, distributional monitoring, error resurfacing, and re-labeling into an ongoing workflow that prevents silent model degradation.

According to Evidently AI’s analysis of production ML, data drift—changes in the distribution of input features—can make models less accurate over time. Machine learning models are not “set it and forget it” solutions; data will shift, requiring model monitoring and retraining. Numerous studies on concept drift, such as Gama et al.’s foundational survey, show that production models degrade substantially when underlying data distributions shift—and that incorporating continuous drift detection and retraining mechanisms can restore accuracy and improve long-term predictive reliability.

SME involvement in continuous labeling includes validating flagged data, updating gold standards as domain realities shift, overseeing evolving taxonomies, and deciding when retraining is triggered. Without expert oversight, models can degrade for months before teams notice—a failure mode that continuous monitoring is specifically designed to prevent.

Hierarchical and multi-stage labeling: for complex computer vision projects

Some domains require labeling that cannot be done in a single pass. Hierarchical labeling addresses this by breaking complex annotation tasks into layers, each of which builds on the previous. In geospatial imagery, this might mean progressing from region identification to field boundaries to crop type to health indicators. In video labeling, the hierarchy moves from shot and scene classification to frame-level object detection. In document intelligence, the progression runs from document type to page to section to entity to relationship.

Research on semi-supervised semantic segmentation and large-scale scene understanding highlights that it can take many minutes per image to achieve finely detailed annotations on complex datasets like Cityscapes. Hierarchical workflows manage this complexity by allowing different expertise levels at different stages—a junior annotator might handle region identification, while a domain expert validates crop health assessments.

SME involvement includes designing hierarchical taxonomies, validating complex layered annotations, and reviewing a sample of the output at each stage. This ensures consistency across annotators and AI systems. The failure mode of skipping hierarchical design is annotation inconsistency and model confusion at inference time—the model receives conflicting signals about how granular concepts relate to their parent categories, producing unreliable predictions in production.

Human feedback loops: RLHF, RLAIF, and evaluation-based learning

For large language models and generative AI, traditional labeling approaches give way to evaluation-based methods. Reinforcement learning from human feedback (RLHF) has emerged as a widely adopted technique in the literature for refining model behavior based on human feedback, with applications ranging from instruction-following models to preference-based reinforcement learning. In these systems, SMEs act as evaluators of quality, safety, compliance, and reasoning—ranking outputs, providing preference judgments, and rewriting responses to demonstrate correct behavior.

Reinforcement learning from AI feedback (RLAIF) offers an alternative that collects feedback from stronger LLMs instead of humans. However, this method is limited by the capability of the AI annotators, especially for customized tasks, and suffers from intrinsic biases. The RLTHF research demonstrates that while AI-labeled data can provide coarse initial alignment, achieving fine-grained alignment with human preferences requires strategic human annotation. With just 6% human annotation, the RLTHF approach achieved a 15.9× higher return on investment compared to random human annotation.

SME involvement in these feedback loops often requires deep domain knowledge—legal experts evaluating compliance reasoning, medical professionals assessing diagnostic suggestions, financial analysts reviewing risk assessments. The goal is not just plausible output but output that matches domain-specific logic and ethical constraints.

Semantic consistency techniques: weak supervision and knowledge-graph integration

The final category addresses scenarios where perfect labels are unavailable or prohibitively expensive. Weak supervision combines multiple noisy heuristics to create probabilistic labels—no single heuristic is reliable, but their aggregated signal approaches ground truth. Knowledge graphs enforce semantic consistency across entities and relationships, ensuring that labels conform to domain ontologies and real-world constraints. A classic survey on relational learning for knowledge graphs illustrates how structured representations can enforce global consistency in predictions.

A review of pseudo-labeling methods notes that weak supervision treats labels in the labeled dataset as potentially inaccurate, updating them along with pseudo-labels during training. This approach is particularly valuable when acquiring perfect labels is impractical—either because expert time is limited or because the labeling task requires inference rather than direct observation.

SME involvement includes determining the labeling logic underlying each heuristic, validating the resulting distributions against domain expectations, and ensuring that the aggregated signal reflects domain realism. Without expert oversight, weak supervision can converge on plausible but incorrect labels—patterns that look reasonable to the algorithm but violate domain constraints that only an expert would recognize. For example, a weak supervision system might learn that “all MRI scans taken with contrast = tumor present” because of dataset bias — but a trained medical professional knows this is false.

The Unifying Trend: SMEs Embedded Throughout the Pipeline

Across all six technique categories, a consistent pattern emerges: subject-matter experts are the quality anchor. They define gold standards that ground the entire annotation process. They shape taxonomies and labeling rules that encode domain knowledge into machine-readable formats. They review model outputs after automated labeling, catching errors that algorithms miss. They set custom performance metrics that reflect real-world requirements rather than generic benchmarks. And they sign off on final training and evaluation datasets, providing the human judgment that no automated check can replace.

This structural role has a hidden cost when ignored. Generic crowdsourced labeling—annotation performed by workers without domain expertise—produces datasets riddled with domain errors. These errors might be invisible to accuracy metrics computed against a flawed consensus, but they become painfully visible when models trained on these datasets encounter real-world edge cases. The model fails precisely where it should succeed, because its training data taught it the wrong patterns. Classic work on non-expert annotations in NLP shows that even when aggregated labels appear accurate, subtle domain errors can persist without expert oversight.

The retraining costs that follow escalate quickly. Teams must diagnose why production performance degraded, trace the failure to specific data quality issues, curate corrective training data, retrain and validate the model, and redeploy—all while customer trust erodes and competitors advance. Organizations that skip SME involvement early in the pipeline pay for it many times over in remediation costs later.

Labeling Tooling: Infrastructure for Human–AI Collaboration

The techniques described above require specialized tooling to operate at scale. Open-source tools are free to use but may not be as scalable or sophisticated as commercial platforms. Data labeling platforms must support multi-step hierarchical workflows, allowing complex data annotation tasks to be decomposed into manageable stages with appropriate review gates. They must enable an automation layer with expert review, presenting model predictions alongside tools for correction and refinement. QA workflows must be built into the platform architecture rather than bolted on as afterthoughts.

Critically, these tools must allow SMEs to contribute without requiring ML expertise. Domain experts are not software engineers; platforms that require coding to participate effectively exclude the very people whose judgment matters most. SME-friendly interfaces that abstract technical complexity while exposing domain-relevant controls are essential for operationalizing expert involvement at scale. An article from Microsoft on guidelines for human–AI interaction underscores the importance of designing AI systems that support users before, during, and after interaction—including when the system is wrong.

SME judgement can easily be integrated into the data labeling workflow if their interfaces include tools such as seamless Q&A for quick validation and clarification, and automated asset distribution for easy review.

Integration with programmatic and continuous labeling pipelines completes the requirements. Annotation platforms must connect seamlessly with model training infrastructure, data quality monitoring systems, and production deployment pipelines. Without this integration, data quality improvements remain siloed—improving the training dataset without propagating benefits to production models.

The distinction between functional and practical tooling matters here. A functional platform has review features; an effective platform optimizes for SME time with intelligent routing, contextual guidance, and feedback loops that reduce back-and-forth. Functional platforms let experts contribute; effective platforms multiply their impact by ensuring they focus on the highest-value tasks.

The make-versus-buy calculation deserves careful consideration. Building SME-centric workflows internally requires sustained investment in UX design, workflow orchestration, and maintenance—costs that compound over time as requirements evolve and scale increases. Purpose-built platforms amortize these investments across many customers, offering capabilities that would be prohibitively expensive to develop in-house.

Why Teams Should Not Build This Infrastructure Alone

The build-versus-buy decision for data labeling infrastructure deserves strategic consideration as there will likely be costs of internal builds that may be substantial.

Engineering overhead compounds over time. SME-centric workflows require constant UX iteration as expert feedback reveals friction points and missing capabilities. Every new modality or technique requires additional development—video annotation differs from image annotation differs from document annotation differs from text annotation. Internal teams rarely have the bandwidth to maintain parity with purpose-built platforms that invest full-time in these capabilities.

Maintenance burden accumulates. Annotation platforms are not static; they require ongoing updates to address bugs, security vulnerabilities, and changing requirements. Internal tools that work adequately for initial use cases often fail to scale or adapt as requirements evolve. The hidden costs of dataset creation extend beyond direct labeling expenses to include the infrastructure required to manage the process at scale.

Opportunity cost is perhaps the largest hidden expense. Engineering time spent building labeling infrastructure is time not spent on model innovation, production deployment, or customer-facing features. Organizations that build internally divert resources from their core competencies to infrastructure that exists as a solved problem in purpose-built platforms.

Brittleness poses operational risk. Custom systems lack the resilience of purpose-built platforms tested across diverse use cases. Edge cases that purpose-built platforms have encountered and handled become failure modes for internal systems encountering them for the first time. The result is unpredictable failures precisely when reliability matters most.

Purpose-built platforms ensure scalable collaboration without bottlenecks, governable quality with traceability, lower total cost of ownership, and faster time-to-production for new models. These benefits compound over time as internal alternatives fall further behind.

Integrated Techniques in Practice

The true power of modern data labeling emerges when multiple techniques work together. Four scenarios illustrate how integrated approaches solve problems that single techniques cannot address.

Document Intelligence for Financial Compliance

Financial compliance document processing combines programmatic rules, SME validation, continuous labeling, and hierarchical extraction. Programmatic rules auto-label standard invoice formats, extracting fields that follow predictable patterns. SMEs validate exceptions and new vendor formats that rules cannot handle. Continuous labeling catches drift as document templates evolve—new fields, changed layouts, updated regulatory requirements. Hierarchical labeling extracts nested information: document type to line items to entities to relationships between entities.

Without continuous monitoring, the model silently degrades as new document formats appear. Yesterday’s accuracy metrics become misleading as the distribution of production documents shifts away from the training set. SME involvement catches these shifts early, triggering retraining before compliance failures occur.

Geospatial Crop Monitoring for Agriculture

Agricultural applications illustrate hierarchical labeling at its most demanding. Tasks progress from region identification to field boundary delineation to crop type classification to health indicator assessment. Pre-annotations accelerate segmentation of common crop types. Active learning surfaces edge cases—unusual crop diseases, sensor anomalies, boundary ambiguities. Continuous monitoring flags seasonal drift and changes in sensor characteristics.

Without hierarchical design, annotations lack the granularity needed for precision agriculture decisions. A model might correctly identify that a region contains crops while failing to distinguish between crop types or health states—information that determines whether intervention is needed.

Video Safety Analytics for Surveillance

Security and safety applications demonstrate the importance of rare-event handling. Multi-stage labeling progresses from scene to activity to object interaction. Active learning surfaces rare but critical behaviors—security breaches, safety violations, anomalous patterns. SMEs arbitrate ambiguous cases, refining guidelines based on observed disagreements. Continuous labeling updates the dataset as threat patterns evolve.

Without SME-guided active learning, rare-but-critical events remain underrepresented in training data. The model performs well on common scenarios while failing on the edge cases that matter most—precisely the situations where accurate detection is critical.

LLM Training for Legal Compliance

Legal applications combine RLHF with programmatic rules, continuous labeling, and knowledge graphs. RLHF captures expert legal judgments on reasoning quality, compliance accuracy, and appropriate hedging. Programmatic rules handle simple regulatory violations that follow clear patterns. Continuous labeling updates the dataset as regulations change—new requirements, updated interpretations, evolving case law. Knowledge graphs ensure entity consistency across documents, preventing contradictions in multi-document reasoning.

Without continuous updates, the model becomes outdated as regulations evolve, creating legal liability. A model trained on yesterday’s regulations provides confidently wrong guidance on today’s requirements—a failure mode with serious consequences.

Kili Technology’s Viewpoint: Operationalizing Expert-Driven Data Creation

From Research to Production Reality

Kili Technology has partnered with organizations across industries to operationalize the principles outlined in this article. Rather than treating SME involvement as a one-time annotation phase, Kili’s approach embeds expert validation throughout the entire data lifecycle.

LCL Bank — Regulatory Compliance at Scale

France's LCL Bank faced a massive compliance challenge: processing millions of identity documents and Energy Performance Diagnostics for Know Your Customer requirements. Manual processing wasn't viable—documents varied in format across national IDs, passports, residence permits, and diverse diagnostic templates, while millions of records required extraction of structured data for regulatory reporting.

The approach combined programmatic and model-based labeling with workflows designed by subject matter experts. LCL engaged business unit personnel—including compliance and CSR management specialists—as volunteer labelers who wanted to be involved in building AI capabilities. This federation of the company around AI allowed hired and volunteer labelers to contribute their domain expertise directly to model training. Standard annotation campaigns processed approximately 5,000 documents in 2–3 weeks using the platform. LCL's trained teams created training datasets without extensive technical overhead. Once models were trained and integrated, over 13 million documents were processed in batches to verify KYC completeness.

What previously required months of manual processing was reduced to weeks. The combination of automation with subject matter expert validation enabled regulatory compliance at a scale impossible with purely manual methods.

Enabled Intelligence — Mission-Critical Geospatial AI

Enabled Intelligence provides high-quality geospatial data for U.S. defense applications. The complexity of defense-related imagery—hyperspectral satellite data, SAR images, electro-optical imagery—demanded precision rates above 95%. After evaluating 35+ platforms, Enabled Intelligence selected Kili.

Their approach used Human-in-the-loop verification and integrated expert review mechanisms into the pipeline to refine AI-generated labels. The platform provided native support for various coordinate reference systems and projection systems, handling hyperspectral, SAR, and electro-optical imagery seamlessly.

Enabled Intelligence achieved 95%+ annotation accuracy while processing millions of geospatial labels. For high-stakes applications in defense and intelligence, the combination of AI acceleration and expert validation proved non-negotiable.

Covéa Insurance — Transforming Customer Insights

Covéa, one of France’s leading mutual insurers, needed to analyze 1.5 million unstructured customer comments from satisfaction surveys. The goal extended beyond sentiment analysis—Covéa sought to uncover deeper patterns in customer behavior to inform strategic marketing, refine service offerings, and proactively address pain points.

Initial attempts to build internal labeling infrastructure encountered obstacles: complexity increased, and the system couldn’t scale to handle diverse data types or to manage consensus among annotators. After switching to Kili, Covéa gained the ability to label text, images, videos, audio, and documents—enabling expansion beyond the initial use case. Multiple labelers and subject matter experts annotated the same assets to measure agreement levels, ensuring high-quality, reliable categorizations. Complex NLP tasks were broken into manageable stages, and the SDK enabled seamless integration with existing models.

Covéa deployed AI-powered customer satisfaction analysis across 9+ projects and 60+ users, eliminating technical debt by choosing a solution that could scale across diverse AI projects.

The Common Thread

What unites these organizations—despite operating in radically different domains—is a shared realization: high-quality AI depends on operationalizing SME involvement, not just deploying automation. Financial compliance SMEs validated document extraction at LCL Bank. Geospatial intelligence experts at Enabled Intelligence reviewed defense imagery. Customer service experts annotated satisfaction data at Covéa.

Each case demonstrates that tooling must be designed around SME workflows, not the other way around. This is where organizations that attempt internal builds typically fail—they underestimate the complexity of building scalable, SME-friendly interfaces.

The Philosophy: Expert-Authored, Continuously Validated Data

Kili’s approach aligns with the industry’s future: AI built on expert-authored, continuously validated data. The platform operationalizes the principles outlined throughout this article—multi-step hierarchical workflows, continuous labeling and monitoring, consensus and QA layers, and seamless integration of programmatic, model-based, and manual labeling.

This is not about providing tools. It is about enabling organizations to scale expert involvement without bottlenecking on manual processes. The future of AI quality depends on systems that embed SME judgment structurally—from dataset design through production monitoring—rather than treating it as an optional add-on or a problem to be automated away.

Organizations that recognize this reality and invest accordingly will build AI systems that perform where it matters most. Those that continue to treat data quality as a cost to be minimized will find their models failing precisely when stakes are highest. The choice is structural, and its consequences will compound over time.

Resources

MIT FutureTech – AI and data-centric research:
https://futuretech.mit.edu/research
Andrew Ng on moving from model-centric to data-centric AI:
https://www.linkedin.com/pulse/moving-from-model-centric-data-centric-ai-andrew-ng/
AIMultiple on human-annotated data and data quality:
https://research.aimultiple.com/human-annotated-data/
Training language models to follow instructions with human feedback (Ouyang et al., 2022):
https://arxiv.org/abs/2203.02155
Deep reinforcement learning from human preferences (Christiano et al., 2017):
https://arxiv.org/abs/1706.03741
RLTHF: Targeted Human Feedback for LLM Alignment (Xu et al., 2025):
https://arxiv.org/abs/2502.13417
QCon SF 2024 – Why ML Projects Fail to Reach Production:
https://www.infoq.com/news/2024/11/why-ml-fails/
Gartner on the cost of poor data quality:
https://www.gartner.com/en/newsroom/press-releases/2021-09-22-gartner-says-organizations-continue-to-underestimate-their-data-and-analytics-risks
STAT on IBM Watson Health’s cancer treatment recommendations:
https://www.statnews.com/2018/07/25/ibm-watson-recommended-unsafe-incorrect-treatments/
Burr Settles – Active Learning Literature Survey (2009):
https://minds.wisconsin.edu/bitstream/handle/1793/60660/TR1648.pdf
Evidently AI – What is data drift in ML and how to detect it:
https://www.evidentlyai.com/ml-in-production/data-drift
Wong – AI-Driven Model-Retraining Architecture to Sustain Operational Accuracy in Data-Drifting Environments (2025):
https://www.researchgate.net/publication/395648105
Cityscapes Dataset for Semantic Urban Scene Understanding (Cordts et al., 2016):
https://arxiv.org/abs/1604.01685
Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks (Lee, 2013):
https://www.researchgate.net/publication/280581078
A Review of Relational Machine Learning for Knowledge Graphs (Nickel et al., 2015):
https://arxiv.org/abs/1503.00759
Cheap and Fast — But is it Good? Evaluating Non-Expert Annotations for NLP Tasks (Snow et al., 2008):
https://aclanthology.org/D08-1027/
Guidelines for Human–AI Interaction (Amershi et al., 2019) – Microsoft Research overview:
https://www.microsoft.com/en-us/research/project/guidelines-for-human-ai-interaction/

‍

Subscribe for updates

Stay updated with the latest news, articles and update directly into your box

December 12, 2025