Foundation Models

LLMs

AI Evaluation

Domain-Specific LLM Benchmarks Guide: The 2026 Map of Vertical AI Evaluations

General-purpose AI benchmarks have saturated, and the field has fragmented into open-source vertical evaluations for domain specific knowledge like medicine, law, finance, science, code, cybersecurity, multilingual reasoning, and multimodal expert work. This guide maps the major public domain-specific LLM benchmarks in 2026, explains how they are built, and shows why even the strongest of them still leave production teams exposed without expert human review.

Kili Technology

May 21, 2026

Heading2

Heading3

AI Summary

Gartner projects more than half of enterprise GenAI models will be domain-specific by 2027, versus 1% in 2024.
HealthBench, the canonical 2026 medical evaluation, uses 48,562 rubric criteria written by 262 physicians across 26 specialties and 60 countries.
LegalBench-RAG (6,858 expert-annotated query-answer pairs) is the first benchmark to evaluate the retrieval half of legal RAG, where most production failures originate.
MMLU-ProX measures up to a 24.3-point performance gap between high- and low-resource languages on the same parallel questions.
Claude Opus 4.5 scores 80.9% on SWE-Bench Verified but only 45.9% on the SEAL harness; the harness changes the score by half on the same model.
Kili Technology provides the secure, multi-project platform infrastructure that lets enterprises run domain evaluation programs at scale — with project-level data isolation, multi-stage review workflows, and audit-ready traceability that helps support teams commit to the EU AI Act and NIST AI RMF.

Introduction

The 2025 saturation of MMLU, GPQA, and SWE-Bench Verified did not end AI evaluation. It pushed evaluation downward into specialized domains. By 2027, Gartner forecasts that more than half of the generative AI models in enterprise use will be domain-specific, up from 1% in 2024. Spend on specialized GenAI models is projected at $1.1 billion in 2025, the fastest-growing slice of a $14.2 billion total. The market is bending toward verticals, which makes vertical benchmarks load-bearing for procurement, compliance, and deployment decisions.

This is also where the failure modes get sharper. A general-purpose leaderboard average is rarely a predictor for model performance in domain specific tasks such as oncology triage, contract review, or transaction monitoring. Domain benchmarks are meant to fill that gap, and many of them are excellent. HealthBench grades responses against rubrics written by 262 physicians. ChemBench, published in Nature Chemistry, beats most working chemists on average across 2,788 expert-curated questions. LegalBench is a 162-task collaborative evaluation built by 40 lawyers and law professors.

But strong benchmarks do not equal safe deployment. The same Claude Opus 4.5 that scores 80.9% on SWE-Bench Verified scores 45.9% on the SEAL harness. Only about 5% of LLM medical evaluations use real patient data. ChemBench's authors flag that frontier models give "overconfident predictions" on basic chemistry. The 2026 benchmark picture is two-sided: dense, useful coverage of target domains, and a persistent gap between leaderboard scores and production behavior that only verified human experts can close.

Why Have AI Benchmarks Fragmented Into Domain-Specific Evaluations?

For a decade, general-purpose benchmarks did the work of separating frontier models from the rest. That separation has collapsed. According to the Stanford 2025 AI Index, MMMU, GPQA, and SWE-Bench scores rose by 18.8, 48.9, and 67.3 points respectively in a single year. MMLU now sits above 88% for every frontier model. Once a benchmark stops differentiating, it stops being a benchmark; it becomes a passing requirement.

The response has been vertical fragmentation. New evaluations target the specific kinds of reasoning that matter in regulated, expert, or high-cost domains: clinical multi-turn dialogue, statutory interpretation, multi-step financial computation, lab-protocol safety, code that interacts with real codebases, multilingual cultural context, and multimodal expert work. The shared property of these benchmarks is that they are built by domain practitioners (physicians, attorneys, chemists, security analysts) rather than by ML researchers writing general questions. Cross-domain harnesses like BigBench cover an array of specialized reasoning types across professional contexts, but enterprise teams typically need narrower, single-domain evaluations that test the model on their specific tasks.

This shift is more than a matter of research convenience. The same dynamic driving Gartner's 50% forecast, that enterprise deployments are vertical-shaped, is what makes vertical evaluation necessary. A bank evaluating a financial-reasoning model is not asking whether it scores 92% on a general aggregate. It is asking whether the model can read a 10-K and not hallucinate a number. Domain benchmarks try to answer that. They mostly succeed at narrowing the question. They mostly fail at closing it.

What Is a Domain-Specific LLM, and How Does It Differ from a General-Purpose Model?

A domain-specific LLM is a large language model adapted to perform specialized tasks in a particular domain (medicine, law, finance, code) with greater accuracy and reliability than a general-purpose model. The adaptation typically uses one or more of three techniques: fine-tuning a foundation model on domain-specific data, retrieval-augmented generation (RAG) that connects the model to an external knowledge base of domain documents, or prompt engineering that scaffolds the model's reasoning steps for specific tasks. Some teams also train custom models from scratch when domain-specific constraints (privacy, on-premise deployment, regulatory audit) rule out commercial APIs, though this is the most resource-intensive path.

The primary difference between general-purpose models and domain-specific models is not architecture, it is training data and evaluation. Both can use the same base model. What separates them is whether the model's knowledge has been shaped by domain-specific data, including case law for legal professionals, structured clinical notes for medical AI, and annotated financial reports for finance, and whether the evaluation tasks used to validate the model reflect real-world workflows in that specific field. A model fine-tuned on medical literature but evaluated only on generic Q&A benchmarks is not a domain-specific model in any useful sense. It is a generic model with extra training data.

This is why domain-specific evaluation and having continuous evaluation loops matters. Without benchmarks built by domain practitioners, there is no way to determine whether a model genuinely interprets specialized terminology (legal jargon, medical history, financial data) or simply pattern-matches well enough to pass surface-level tests. The rest of this guide maps the public benchmarks that try to answer that question across each major specific domain, and where they fall short.

What Are the Most Important Medical AI Benchmarks in 2026?

Medical AI evaluation has the longest track record of any vertical. The field has moved from saturated multiple-choice ladders to rubric-graded multi-turn conversations, and the methodological gap between exam scoring and clinical scoring is now the central design question.

MedQA / MultiMedQA: saturated

Format: USMLE-style multiple choice
Scale: Standard medical licensing question set
Status: Saturated. PaLM 67.6% → Med-PaLM 2 86.5% → frontier models above 95%

MedQA and MultiMedQA, introduced with Med-PaLM in Singhal et al.'s 2023 Nature paper, became the standard medical multiple-choice ladders. Beyond licensing-style questions, MultiMedQA also probes factuality, potential harm, and the model's ability to map free-text clinical documentation onto standardized diagnostic codes. They no longer differentiate frontier models, but the multi-axis design they introduced (correctness plus harm plus coding fidelity) is the template most newer medical benchmarks build on. Specialized healthcare models are tested for precision in reading complex terminology where a general model is likely to misread or oversimplify.

HealthBench: the unsaturated successor

Format: 5,000 multi-turn conversations, rubric-graded
Scale: 48,562 conversation-specific rubric criteria written by 262 physicians across 26 specialties and 60 countries
Status: Unsaturated. GPT-3.5 Turbo 16% → GPT-4o 32% → o3 60%

The unsaturated successor is HealthBench, released by OpenAI in May 2025. GPT-4.1 nano now beats GPT-4o at one twenty-fifth of the cost, a useful proxy for how fast capability is moving in medicine.

The methodological design matters most. HealthBench grades against physician-authored rubrics, not against gold answers. Each conversation has bespoke criteria (what the model must mention, what it must avoid, what level of caveat is appropriate) and an LLM judge scores against those criteria. This is the same evaluation pattern that working clinical reviewers use when assessing trainees, and it is the same pattern that vertical AI deployments need internally. Coordinating this kind of multi-specialty evaluation at enterprise scale requires platform infrastructure that can isolate data per specialty, track per-evaluator performance, and enforce step separation in review workflows so that a cardiologist's rubric judgments are not visible to the dermatology team and vice versa.

The unsolved problem: realism

A 2025 systematic review of 761 LLM medical evaluation studies found that only about 5% assessed models on real patient care data. Most benchmarks, including HealthBench, use synthetic conversations. A randomized trial referenced in the same review found that giving physicians access to GPT-4 did not significantly improve their clinical reasoning performance, despite the model's strong standalone scores. The lesson: a top HealthBench score is necessary but not sufficient. You still need clinicians grading model outputs on actual encounters before you can trust the model in a workflow.

How Do Open-Source Legal Benchmarks Evaluate LLMs?

Legal benchmarks have stratified into reasoning evaluations, retrieval evaluations, and rubric-graded practitioner-task evaluations. Models perform progressively worse as the evaluation moves closer to actual attorney work.

LegalBench: the breadth benchmark

Format: 162 tasks across six types of legal reasoning
Scale: Built by 40 lawyers, computational practitioners, and law professors
Status: The dominant English-language legal benchmark

LegalBench is described by its authors as "an ongoing open science effort." Its strength is breadth: rule recall, issue spotting, rule application, statutory interpretation, contract clause classification, and rhetorical understanding all sit inside one harness. The combined evaluation tests whether a model can interpret statutes, analyze legal precedents, and resolve linguistic ambiguity the way an attorney would on a real matter. Its weakness is that it tests reasoning over clean inputs.

LegalBench-RAG: the retrieval benchmark

Format: Query-answer pairs over real legal documents
Scale: 6,858 expert-annotated pairs over a 79-million-character corpus (NDAs, M&A agreements, contracts, privacy policies)
Status: First benchmark to evaluate the retrieval half of legal RAG

Production legal AI does not get clean inputs. It gets a 200-page master services agreement and a question that requires finding the right clause first. LegalBench-RAG addresses that gap. In production, retrieval is where most legal AI systems fail: the model reasons fine over the right context, but the right context never reaches it.

LawBench and PLawBench: the practitioner-task benchmarks

LawBench: 51 LLMs evaluated across 20 Chinese legal tasks structured by three cognitive levels (memorization, understanding, application). Authors conclude open-source models are "still a long way from usable."
PLawBench (2026 preprint): 850 questions, ~12,500 rubric items grounded in real legal workflows. None of 10 evaluated frontier LLMs achieves strong performance.

For non-English jurisdictions, LawBench finds that even supervised fine-tuning produces only modest gains. The 2026 PLawBench preprint, grounded in consultation, case analysis, and document generation tasks, is currently a preprint and should be cited as such.

The pattern across the legal stack is consistent: models do well on isolated reasoning tasks, weaker on retrieval over real document corpora, and worst on rubric-graded practitioner tasks that mirror actual attorney work. Practicing attorneys are the only graders who can validate jurisdiction-specific outputs reliably. The benchmarks abstract that judgment; production deployments cannot.

Can LLMs Be Trusted on Financial Reasoning Benchmarks?

Financial benchmarks expose a recurring weakness: small numerical errors cascade through multi-step reasoning, and average-case accuracy hides the worst-case behavior that actually matters in regulated workflows.

FinBen: the breadth benchmark

Format: 36 datasets, 24 tasks across 7 categories (information extraction, textual analysis, QA, generation, risk, forecasting, decision-making)
Scale: 21 LLMs evaluated at release
Status: The widest-coverage open finance evaluation

Financial evaluation has converged around FinBen as the broad reference. It also includes a trading task category, which most other open benchmarks omit.

FinQA, ConvFinBench, TAT-QA, BizFinBench: the reasoning ladders

FinQA: Multi-step reasoning programs over financial documents; ~84% human expert F1 ceiling
TAT-QA: Table-and-text reasoning over real financial reports
ConvFinBench: Conversational finance QA
BizFinBench: Cross-concept business analysis; finds models "struggle with complex scenarios requiring cross-concept reasoning"

The narrower ladder is FinQA, which annotates each question with a multi-step reasoning program over financial documents. FinQA exposes the central failure mode of LLM financial reasoning: small arithmetic errors cascade. A model that misreads one cell of a balance sheet and propagates the error downstream produces a confidently wrong answer. FinanceBench takes the same logic further, testing whether a model can compute composite metrics like EBITDA or P/E ratios from raw filings without dropping a sign or a decimal. BizFinBench's findings confirm the same brittleness FinQA exposed two years earlier.

Why benchmark scores understate financial risk

Financial errors are asymmetric in cost. A model that hallucinates a revenue figure in an SEC filing summary, or misclassifies a transaction in an AML pipeline, has direct downstream consequences. Public benchmarks do not measure that asymmetry. They measure averages. A 92% score that produces 8% wrong financial outputs in production is a regulatory liability, not a deployment success. Expert review on production outputs is the layer benchmarks cannot provide.

How Well Do LLMs Perform on Scientific and Chemistry Benchmarks?

Science benchmarks reveal a calibration problem that other domains hint at but chemistry makes explicit: high accuracy on average coexists with overconfident wrong answers on basics, and that mismatch is dangerous in laboratory contexts.

ChemBench: peer-reviewed in Nature Chemistry

Format: 2,788 question-answer pairs evaluated against 19 expert chemists
Scale: Published in Nature Chemistry (2025)
Headline finding: Best LLMs outperformed best human chemists on average
Caveat: Models "struggle with some basic tasks and provide overconfident predictions"

The strongest recent peer-reviewed citation in this space is ChemBench. Read the full paper and the headline becomes complicated. The authors report no correlation between molecular complexity and model accuracy: strong evidence that high scores reflect memorization of training-set chemistry rather than chemical reasoning. The model that beats expert chemists on average still misses elementary safety and analytical-chemistry questions, and reports its wrong answers with the same confidence as its right ones. For a domain where wrong answers can mean lab accidents or unsafe synthesis routes, "outperforms humans on average" is the wrong metric. Calibration and worst-case behavior matter more.

SciBench: the broader science benchmark

Format: College-level open-ended problems in math, physics, chemistry
Best-model score: ~43.22%
Status: Not saturated; no single prompting strategy dominates across subjects

SciBench shows the gap between exam performance and laboratory competence remains the same gap that the medical literature documents: strong test-taker, weak practitioner.

How Are Coding Benchmarks Evolving Beyond HumanEval?

Code evaluation has moved through three generations: function-level benchmarks that saturated, contamination-resistant harvesters that try to outrun training cutoffs, and harness-sensitive end-to-end evaluations where the same model produces wildly different scores depending on the test setup.

HumanEval and MBPP: saturated

Status: Saturated. The BigCodeBench paper states this directly: HumanEval and MBPP "have been saturated by recent model releases."

BigCodeBench: practical library-driven tasks

Format: Python tasks requiring diverse function calls and complex instruction-following
Replaces: Function-level benchmarks like HumanEval/MBPP

LiveCodeBench: contamination-resistant

Format: Continuously harvested competitive-programming problems from LeetCode, AtCoder, Codeforces
Design goal: Test cases postdate model training cutoffs
Authors' claim: "Resistant to both test set contamination and the pitfalls of LLM judging"

A model that has seen the test set in pretraining will score well on it regardless of capability. LiveCodeBench is the standard response.

SWE-Bench Verified vs SEAL: the harness sensitivity problem

Same model (Claude Opus 4.5): 80.9% on SWE-Bench Verified, 45.9% on SEAL
Implication: Half the score from the same model on tasks of comparable apparent difficulty

The most telling single data point in code evaluation is harness sensitivity. SWE-bench itself was designed to push past simple coding tests by requiring models to resolve real-world GitHub issues end to end, and the Verified subset filters those issues for solvability. Even on that hardened evaluation, when you read a code-benchmark headline, you are reading a model-and-harness pair, not a model. ResearchCodeBench, a Stanford evaluation of whether models can implement recently published ML papers from scratch, finds that the best models implement under 40% of papers correctly. Code generation looks far stronger on isolated problems than on real engineering work. For a worked example of how these harness effects play out on a single frontier model, see our deep dive into DeepSeek V4.

What About Cybersecurity, Multilingual, and Multimodal Evaluations?

These three domains share a common pattern: public benchmarks have improved sharply but cannot replicate the conditions (adversarial environments, regional languages, real expert workflows) where production deployments actually live.

Cybersecurity: CyberSecEval and CyberMetric

CyberSecEval / CyberSOCEval (Meta): Extends coverage to malware analysis and threat-intelligence reasoning. Authors note prior benchmarks "do not fully assess the scenarios most relevant to real-world defenders."
CyberMetric: 80/500/2,000/10,000-question RAG-grounded benchmark built from NIST standards, RFCs, and academic literature. Validated against 30 human participants and 200+ expert review hours.

Cybersecurity evaluation moved from generic capture-the-flag scoring to operational tasks in 2025. CyberSecEval and CyberMetric are the leading examples. The methodological standard in cybersecurity benchmarking has risen sharply, but public benchmarks still cannot replicate adversarial conditions in private environments.

Multilingual: MMLU-ProX

Format: 11,829 parallel questions across 29 languages
Headline finding: Performance gaps "of up to 24.3%" between high- and low-resource languages on the same questions
Most affected: African languages

Multilingual evaluation is where the production gap is largest. MMLU-ProX, introduced at EMNLP 2025, makes this measurable. For European or non-English enterprises, the implication is direct: an English MMLU score does not predict same-quality performance in French, Arabic, or Swahili, and the gap widens as resources thin.

Multimodal: MMMU and MMMU-Pro

MMMU: 11,500 multimodal expert questions across 30 subjects, 183 subfields. Approaching its human ceiling (76.2–88.6%); Gemini 3 Flash at 87.63%.
MMMU-Pro: Filters text-only-answerable questions, adds distractors, embeds questions inside images. Scores drop 16.8–26.9% from MMMU across all evaluated models.

Multimodal evaluation has followed the same saturation-then-fragmentation arc. MMMU is approaching ceiling, and the successor MMMU-Pro hardens the evaluation. That drop reflects an evaluation hardening, not a capability regression: models exploit text-only shortcuts when image and question are separable, and MMMU-Pro removes the shortcut.

Why Do Domain-Specific Benchmarks Still Fail in Production?

Three failure modes recur across every domain covered above.

Failure mode 1: Exam performance ≠ workflow performance

The 5% real-patient-data finding from the medical systematic review is one example. The Stanford ResearchCodeBench finding that models implement under 40% of recent ML papers is another. Models that ace standardized questions often falter on the unstructured, multi-step work that real practitioners do.

Failure mode 2: Contamination

Pretraining corpora include NIST RFCs, USMLE archives, SEC filings, LeetCode problems, and most of the academic chemistry literature. A 2024 GSM8K analysis showed accuracy dropping 13% when contaminated examples were removed. Contamination-resistant designs like LiveCodeBench help, but the broader corpus exposure problem is structural. Domain benchmarks built on public expert content are by definition partially in the training set of any frontier model.

Failure mode 3: Harness sensitivity

The 80.9% versus 45.9% Opus 4.5 split is the cleanest illustration: change the test harness, change the score by half, on the same tasks at the same nominal difficulty. The headline number you read is downstream of dozens of evaluation choices that rarely make it into the benchmark report.

A counter-intuitive cross-domain finding

TRIDENT-Bench, the first systematic LLM safety benchmark spanning finance, medicine, and law, concludes that "domain specialization alone does not guarantee ethical robustness, and in some cases may increase failure rates." A finance-specialized model is not automatically safer for finance. Specialization can narrow capability while widening blind spots, particularly on safety-adjacent edge cases that fall outside the specialization training distribution. The same paper that confirms vertical benchmarks are useful also confirms that they are not sufficient.

How Is Regulation Reshaping Domain Benchmarks?

Benchmarks are no longer purely academic instruments. The EU AI Act incorporates them in several binding provisions. Eitel-Porter et al.'s 2025 review traces benchmark references through Articles 15(2), 51(1), 58(2), 66(g), and 68(3): for high-risk systems, benchmarks are expected to play a "significant role" in compliance with accuracy, robustness, and cybersecurity requirements; for general-purpose AI models with systemic risk, they are described as "fundamental" to classification.

In the United States, NIST AI 600-1, the Generative AI Profile released in July 2024, extends the AI Risk Management Framework specifically to generative-AI risks: confabulation, IP exposure, harmful content, and downstream misuse. The Profile does not mandate specific benchmarks, but it expects organizations to maintain evidence of evaluation against the risks it enumerates. Audit trails, not screenshots of leaderboards, are the deliverable. Producing those audit trails requires the underlying evaluation platform to maintain per-project traceability, role-based access logs, and versioned evaluation data. A spreadsheet of scores does not meet that bar. The infrastructure has to record who evaluated what, when, under which rubric version, and with what access permissions — and it has to do so across every active evaluation workstream simultaneously.

For organizations deploying domain-specific AI in regulated industries, this changes the procurement calculation. A vendor presenting a HealthBench score is no longer presenting evaluation. They are presenting one input to evaluation. The full deliverable is reproducible, auditable evidence that the deployed system performs adequately on the regulator's risk dimensions, validated by qualified experts. Compliance and security evaluations now check whether models meet industry standards and the regulatory frameworks governing finance, law, and healthcare, and that adherence has to be testable on demand. That evidence is something benchmarks contribute to, not something they replace. Domain accuracy is only one half of the regulatory picture; the other is safety, which is why adversarial and red-teaming evaluation has become a documented compliance activity alongside capability benchmarks.

What Does Production-Ready Domain Evaluation Look Like?

The honest answer is that it is multi-layered, and benchmarks are the cheapest layer.

A production-grade evaluation stack typically runs three layers:

Automated metrics (accuracy, BLEU, F1, latency) give continuous signal at scale.
LLM-as-judge evaluations, especially rubric-graded judges of the kind HealthBench pioneered, give scalable approximations of expert judgment on subjective dimensions.
Human expert review, the most expensive layer, is the only one that produces ground truth on the actual outputs your model generates in your actual workflow.

Each layer constrains the layer below it. Automated metrics flag regressions; LLM judges localize them; experts adjudicate the cases where the judges disagree.

Two practical points sit underneath this stack. First, evaluation needs a repeatable testing process so that practitioners can establish baselines and compare models across fine-tuning runs, RAG retrieval changes, and prompt-engineering revisions. A model that performs well in a one-off test but cannot be evaluated identically a month later is not production-ready, however high its benchmark score. Second, every layer of the stack depends on data quality at the input. Accurately annotated domain-specific data is what distinguishes a useful evaluation from a noisy one across fine-tuning datasets, RAG knowledge bases, and the rubrics human experts use to grade outputs. Good annotation processes, run by people who understand the domain, are what allow the data to reflect the edge cases that matter in practice.

MIT Technology Review's 2026 piece on the limits of benchmarks puts it sharply: "Today's benchmarks resemble school exams, one-off, standardized tests" that bear little resemblance to how AI is actually used. The piece argues for a Human-AI Collaboration paradigm in which competence is treated as relational and longitudinal, the way junior doctors and lawyers are evaluated continuously inside real workflows under supervision. That is the direction domain evaluation is heading, and it is the gap that no public benchmark closes on its own.

The Real Lesson of Domain Benchmarks

Vertical benchmarks are not a replacement for general benchmarks. They are a reaction to general-benchmark saturation. They give procurement teams sharper questions to ask and give research teams cleaner targets to optimize against. HealthBench, LegalBench-RAG, ChemBench, FinBen, BigCodeBench, MMLU-ProX, MMMU-Pro: each is a real advance over the generic leaderboards they partially displace.

What they do not do is close the production gap. The 5% real-patient-data finding, the 80.9-to-45.9 harness collapse, the 24.3-point multilingual gap, ChemBench's overconfidence finding, and TRIDENT's specialization-without-safety result all point in the same direction: the strongest public benchmark in any vertical is a filter, not a verdict. Benchmarks tell you which models are worth evaluating against your real workflow. Real-workflow evaluation, graded by qualified domain experts, tells you whether the model is safe to ship.

For enterprises building in regulated verticals, the practical implication is that the benchmark layer and the expert-review layer have to coexist. Public benchmarks give you comparability and external validation. Verified domain experts (clinicians, attorneys, chemists, security analysts, multilingual specialists) give you ground truth on the outputs the model actually produces in your stack. The 2027 enterprise GenAI deployment that Gartner is projecting will run on both.

Why Does Evaluation at Scale Need Platform Infrastructure?

The benchmarks covered in this guide each focus on one domain. An enterprise running domain-specific AI across multiple verticals — medical, legal, financial — has to run all of them at once, with separate evaluation teams that should never see each other's data. A hospital system evaluating clinical AI and contract-review AI simultaneously needs project-level data isolation between the two workstreams, role-based access so clinicians never encounter legal documents and attorneys never encounter patient data, and configurable QA workflows tuned to the rubric standards of each domain. Add per-evaluator performance tracking across all of those projects, and the operational requirement is clear: this is data labeling and evaluation infrastructure, not a one-off service engagement.

The platform capabilities that make this possible include multi-project architecture with data isolation enforced at the project level, four distinct project roles (Admin, Team Manager, Reviewer, Labeler) that map to how evaluation teams are actually structured, multi-step review workflows with configurable sampling rates and step-separation enforcement so that the same person cannot both label and review the same item, and built-in quality metrics like honeypot testing and consensus scoring. SOC 2, ISO 27001, and GDPR compliance ensure the infrastructure itself meets the same regulatory bar as the AI systems it evaluates.

Resources

Medical Benchmarks

HealthBench (Arora et al., 2025) — OpenAI's 5,000-conversation rubric-graded medical benchmark
- https://arxiv.org/abs/2505.08775
Large language models encode clinical knowledge (Singhal et al., 2023, Nature) — Original Med-PaLM and MultiMedQA paper
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10396962/
Knowledge-Practice Performance Gap in Clinical LLMs (PMC, 2025) — Systematic review of 761 medical LLM studies
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12706444/

Legal Benchmarks

LegalBench (Guha et al., NeurIPS 2023) — 162-task collaborative legal reasoning benchmark
- https://arxiv.org/abs/2308.11462
LegalBench-RAG (Pipitone & Houir Alami, 2024) — 6,858-pair retrieval evaluation for legal RAG
- https://arxiv.org/abs/2408.10343
LawBench (Fei et al., 2023) — Chinese legal benchmark across 51 LLMs and 20 tasks
- https://arxiv.org/abs/2309.16289
PLawBench (preprint, 2026) — Rubric-based practitioner-task legal benchmark
- https://arxiv.org/pdf/2601.16669

Financial Benchmarks

FinBen (Xie et al., NeurIPS 2024) — 36-dataset financial benchmark across seven task categories
- https://arxiv.org/abs/2402.12659

Science and Chemistry Benchmarks

ChemBench (Mirza et al., Nature Chemistry, 2025) — 2,788-question chemistry evaluation against 19 expert chemists
- https://www.nature.com/articles/s41557-025-01815-x
SciBench (Wang et al., ICML 2024) — College-level scientific problem-solving benchmark
- https://arxiv.org/abs/2307.10635

Code Benchmarks

BigCodeBench (Zhuo et al., 2024) — Library-driven Python tasks with diverse function calls
- https://arxiv.org/pdf/2406.15877
LiveCodeBench (Jain et al., 2024) — Contamination-resistant competitive programming evaluation
- https://arxiv.org/pdf/2403.07974

Cybersecurity Benchmarks

CyberSOCEval / CyberSecEval (Wan et al., Meta, 2025) — SOC-task and threat-intelligence evaluation
- https://arxiv.org/abs/2509.20166
CyberMetric (Tihanyi et al., 2024) — RAG-grounded cybersecurity knowledge benchmark
- https://arxiv.org/abs/2402.07688

Multilingual and Multimodal Benchmarks

MMLU-ProX (Xuan et al., EMNLP 2025) — 29-language parallel evaluation with 24.3-point performance gaps
- https://arxiv.org/abs/2503.10497
MMMU (Yue et al., CVPR 2024) — 11,500-question multimodal expert benchmark
- https://arxiv.org/abs/2311.16502
MMMU-Pro (Yue et al., ACL 2025) — Robustness-hardened multimodal benchmark
- https://aclanthology.org/2025.acl-long.736/

Cross-Domain Safety

TRIDENT-Bench (Hui et al., 2025) — First systematic LLM safety benchmark across finance, medicine, and law
- https://arxiv.org/pdf/2507.21134

Market and Policy

Gartner GenAI Spending Forecast (2025) — Domain-specific GenAI growth projections
- https://www.gartner.com/en/newsroom/press-releases/2025-07-10-gartner-forecasts-worldwide-end-user-spending-on-generative-ai-models-to-total-us-dollars-14-billion-in-2025
Stanford 2025 AI Index Report — Benchmark progress and RAI evaluation gaps
- https://hai.stanford.edu/ai-index/2025-ai-index-report
NIST AI 600-1 Generative AI Profile (2024) — Risk management framework for GenAI
- https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence
Can We Trust AI Benchmarks? (Eitel-Porter et al., 2025) — Interdisciplinary review including EU AI Act benchmark provisions
- https://arxiv.org/pdf/2502.06559

‍

Frequently Asked Questions

How do domain-specific benchmarks differ from general-purpose AI benchmarks?

General-purpose benchmarks like MMLU test broad knowledge across many subjects. Domain-specific benchmarks test the particular reasoning patterns that matter in a single field — clinical multi-turn dialogue, statutory interpretation, multi-step financial computation — using evaluation criteria written by practitioners in that field. The questions are harder, the grading is more granular, and the results are more predictive of how a model will perform on real work in that domain.

Why do enterprises need to run their own AI evaluations instead of relying on public benchmark scores?

Public benchmarks test models on standardized inputs under controlled conditions. Production environments are neither standardized nor controlled. The same model that scores 80.9% on one code benchmark scores 45.9% on another harness testing comparable tasks. Only evaluation against your own data, your own workflows, and your own domain experts produces evidence that the model works for your use case. Public scores are a shortlist filter, not a deployment decision.

What platform capabilities matter most for running domain evaluation at enterprise scale?

The operational requirements are project-level data isolation (so different evaluation workstreams never leak into each other), role-based access control, multi-step review workflows with configurable quality gates, and per-evaluator performance tracking. Automation matters too: API and SDK access let teams integrate evaluation into CI/CD pipelines so that every model update triggers a new evaluation cycle without manual intervention.

How does Kili Technology support audit-ready AI evaluation for regulated industries?

Kili maintains per-project traceability from individual evaluator decisions through to aggregated scores, with role-based access logs and versioned evaluation data. The platform is SOC 2, ISO 27001, HIPAA, and GDPR compliant, and supports cloud storage integrations restricted to specific projects. This means every evaluation decision is recorded, attributable, and reproducible — the kind of evidence trail that the EU AI Act and NIST AI RMF expect organizations to maintain.

Can an enterprise use its own domain experts on the Kili platform, or does it have to use Kili's workforce?

Enterprises can onboard their own domain experts as evaluators and manage them through Kili's project roles and workflow configuration. The platform's managed expert workforce is a complementary option for teams that need to scale capacity quickly — for example, when a new evaluation program launches and needs 50 additional reviewers within a week. The two approaches work together: internal experts set the rubrics and handle the most sensitive evaluations, while managed evaluators absorb volume spikes.

‍

Run Domain Evaluation Programs at Scale

Public domain-specific benchmarks have done what saturated general benchmarks could not: they have given enterprises a credible filter for which large language models are worth shortlisting in a specific field. They have not given a verdict. The failure modes documented above — the 5% real-patient-data gap, the 80.9-to-45.9 harness collapse, ChemBench's overconfidence finding — are visible only when domain experts grade the model's outputs on real workflow data.

Kili Technology provides the platform infrastructure for that work: multi-project architecture with data isolation between evaluation workstreams, configurable review workflows with step separation, per-evaluator quality tracking, and SOC 2 / ISO 27001 / GDPR compliance. For teams that need to scale beyond internal experts, Kili's managed workforce of 2,000+ verified domain specialists is available as surge capacity.

Build an evaluation tailored to your workflow — start with our playbook on how to build a custom benchmark.

Subscribe for updates

Stay updated with the latest news, articles and update directly into your box

July 22, 2026

Kimi K3's Benchmarks and Hallucinations — What That Tells Us About AI Evaluation

Kimi K3 ranked third on the AI Intelligence Index while its hallucination rate hit 51%. Here is what that paradox reveals about how the industry evaluates models.

Kili Technology

AI Evaluation

Foundation Models

July 15, 2026

Best On-Premise Data Labeling Platforms for Regulated Industries [2026] Guide

Compare the best on-premise data labeling platforms for defense, healthcare, and finance in 2026. This guide evaluates secure deployment models, certifications (SOC 2, ISO 27001, HIPAA), air-gapped operations, and quality-at-scale for teams labeling sensitive AI training data.

Kili Technology

Data Labeling

July 15, 2026

Introduction EU AI Act: What Every AI Team Needs to Know Before August 2026

The EU AI Act regulates AI applications by risk level, assigning obligations to every organisation that develops or deploys AI systems affecting people in the EU. This guide covers what the Act requires, who is in scope, which use cases are affected, and the enforcement timeline your team should be working against.

Kili Technology

Foundation Models

AI Evaluation

Data Labeling

Domain-Specific LLM Benchmarks Guide: The 2026 Map of Vertical AI Evaluations

Table of contents

AI Summary

Introduction

Why Have AI Benchmarks Fragmented Into Domain-Specific Evaluations?

What Is a Domain-Specific LLM, and How Does It Differ from a General-Purpose Model?

What Are the Most Important Medical AI Benchmarks in 2026?

MedQA / MultiMedQA: saturated

HealthBench: the unsaturated successor

The unsolved problem: realism

How Do Open-Source Legal Benchmarks Evaluate LLMs?

LegalBench: the breadth benchmark

LegalBench-RAG: the retrieval benchmark

LawBench and PLawBench: the practitioner-task benchmarks

Can LLMs Be Trusted on Financial Reasoning Benchmarks?

FinBen: the breadth benchmark

FinQA, ConvFinBench, TAT-QA, BizFinBench: the reasoning ladders

Why benchmark scores understate financial risk

How Well Do LLMs Perform on Scientific and Chemistry Benchmarks?

ChemBench: peer-reviewed in Nature Chemistry

SciBench: the broader science benchmark

How Are Coding Benchmarks Evolving Beyond HumanEval?

HumanEval and MBPP: saturated

BigCodeBench: practical library-driven tasks

LiveCodeBench: contamination-resistant

SWE-Bench Verified vs SEAL: the harness sensitivity problem

What About Cybersecurity, Multilingual, and Multimodal Evaluations?

Cybersecurity: CyberSecEval and CyberMetric

Multilingual: MMLU-ProX

Multimodal: MMMU and MMMU-Pro

Why Do Domain-Specific Benchmarks Still Fail in Production?

Failure mode 1: Exam performance ≠ workflow performance

Failure mode 2: Contamination

Failure mode 3: Harness sensitivity

A counter-intuitive cross-domain finding

How Is Regulation Reshaping Domain Benchmarks?

What Does Production-Ready Domain Evaluation Look Like?

The Real Lesson of Domain Benchmarks

Why Does Evaluation at Scale Need Platform Infrastructure?

Resources

Medical Benchmarks

Legal Benchmarks

Financial Benchmarks

Science and Chemistry Benchmarks

Code Benchmarks

Cybersecurity Benchmarks

Multilingual and Multimodal Benchmarks

Cross-Domain Safety

Market and Policy

Further Reading

‍

Frequently Asked Questions

How do domain-specific benchmarks differ from general-purpose AI benchmarks?

Why do enterprises need to run their own AI evaluations instead of relying on public benchmark scores?

What platform capabilities matter most for running domain evaluation at enterprise scale?

How does Kili Technology support audit-ready AI evaluation for regulated industries?

Can an enterprise use its own domain experts on the Kili platform, or does it have to use Kili's workforce?

Run Domain Evaluation Programs at Scale

Subscribe for updates

Related articles

Kimi K3's Benchmarks and Hallucinations — What That Tells Us About AI Evaluation

Best On-Premise Data Labeling Platforms for Regulated Industries [2026] Guide

Introduction EU AI Act: What Every AI Team Needs to Know Before August 2026

Ready when you are. Start your free trial.