Learn best practices for combining LLM-as-a-judge and HITL workflows for reliable AI.

Download the Report →
Foundation Models
LLMs
AI Evaluation

Domain-Specific LLM Benchmarks Guide: The 2026 Map of Vertical AI Evaluations

General-purpose AI benchmarks have saturated, and the field has fragmented into open-source vertical evaluations for domain specific knowledge like medicine, law, finance, science, code, cybersecurity, multilingual reasoning, and multimodal expert work. This guide maps the major public domain-specific LLM benchmarks in 2026, explains how they are built, and shows why even the strongest of them still leave production teams exposed without expert human review.

Table of contents

AI Summary

  • Gartner projects more than half of enterprise GenAI models will be domain-specific by 2027, versus 1% in 2024.
  • HealthBench, the canonical 2026 medical evaluation, uses 48,562 rubric criteria written by 262 physicians across 26 specialties and 60 countries.
  • LegalBench-RAG (6,858 expert-annotated query-answer pairs) is the first benchmark to evaluate the retrieval half of legal RAG, where most production failures originate.
  • MMLU-ProX measures up to a 24.3-point performance gap between high- and low-resource languages on the same parallel questions.
  • Claude Opus 4.5 scores 80.9% on SWE-Bench Verified but only 45.9% on the SEAL harness; the harness changes the score by half on the same model.
  • Kili Technology pairs the public benchmark layer with verified domain experts (clinicians, attorneys, chemists, security analysts) and audit-ready evaluation traceable enough to satisfy the EU AI Act and NIST AI RMF.

Introduction

The 2025 saturation of MMLU, GPQA, and SWE-Bench Verified did not end AI evaluation. It pushed evaluation downward into specialized domains. By 2027, Gartner forecasts that more than half of the generative AI models in enterprise use will be domain-specific, up from 1% in 2024. Spend on specialized GenAI models is projected at $1.1 billion in 2025, the fastest-growing slice of a $14.2 billion total. The market is bending toward verticals, which makes vertical benchmarks load-bearing for procurement, compliance, and deployment decisions.

This is also where the failure modes get sharper. A general-purpose leaderboard average is rarely a predictor for model performance in domain specific tasks such as oncology triage, contract review, or transaction monitoring. Domain benchmarks are meant to fill that gap, and many of them are excellent. HealthBench grades responses against rubrics written by 262 physicians. ChemBench, published in Nature Chemistry, beats most working chemists on average across 2,788 expert-curated questions. LegalBench is a 162-task collaborative evaluation built by 40 lawyers and law professors.

But strong benchmarks do not equal safe deployment. The same Claude Opus 4.5 that scores 80.9% on SWE-Bench Verified scores 45.9% on the SEAL harness. Only about 5% of LLM medical evaluations use real patient data. ChemBench's authors flag that frontier models give "overconfident predictions" on basic chemistry. The story of 2026 benchmarks is two-sided: dense, useful coverage of target domains, and a persistent gap between leaderboard scores and production behavior that only verified human experts can close.

Why Have AI Benchmarks Fragmented Into Domain-Specific Evaluations?

For a decade, general-purpose benchmarks did the work of separating frontier models from the rest. That separation has collapsed. According to the Stanford 2025 AI Index, MMMU, GPQA, and SWE-Bench scores rose by 18.8, 48.9, and 67.3 points respectively in a single year. MMLU now sits above 88% for every frontier model. Once a benchmark stops differentiating, it stops being a benchmark; it becomes a passing requirement.

The response has been vertical fragmentation. New evaluations target the specific kinds of reasoning that matter in regulated, expert, or high-cost domains: clinical multi-turn dialogue, statutory interpretation, multi-step financial computation, lab-protocol safety, code that interacts with real codebases, multilingual cultural context, and multimodal expert work. The shared property of these benchmarks is that they are built by domain practitioners (physicians, attorneys, chemists, security analysts) rather than by ML researchers writing general questions. Cross-domain harnesses like BigBench cover an array of specialized reasoning types across professional contexts, but enterprise teams typically need narrower, single-domain evaluations that test the model's capabilities on their specific tasks.

This shift reflects something more than research convenience. The same dynamic driving Gartner's 50% forecast, that enterprise deployments are vertical-shaped, is what makes vertical evaluation necessary. A bank evaluating a financial-reasoning model is not asking whether it scores 92% on a general aggregate. It is asking whether the model can read a 10-K and not hallucinate a number. Domain benchmarks try to answer that. They mostly succeed at narrowing the question. They mostly fail at closing it.

What Is a Domain-Specific LLM, and How Does It Differ from a General-Purpose Model?

A domain-specific LLM is a large language model adapted to perform specialized tasks in a particular domain (medicine, law, finance, code) with greater accuracy and reliability than a general-purpose model that lacks depth in specialized knowledge. The adaptation typically uses one or more of three techniques: fine-tuning a foundation model on domain-specific data, retrieval-augmented generation (RAG) that connects the model to an external knowledge base of domain documents, or prompt engineering that scaffolds the model's reasoning steps for specific tasks. Some teams also train custom models from scratch when domain-specific constraints (privacy, on-premise deployment, regulatory audit) rule out commercial APIs, though this is the most resource-intensive path.

The primary difference between general-purpose models and domain-specific models is not architecture, it is training data and evaluation. Both can use the same base model. What separates them is whether the model's knowledge has been shaped by domain-specific data, including case law for legal professionals, structured clinical notes for medical AI, and annotated financial reports for finance, and whether the evaluation tasks used to validate the model reflect real-world workflows in that specific field. A model fine-tuned on medical literature but evaluated only on generic Q&A benchmarks is not a domain-specific model in any useful sense. It is a generic model with extra training data.

This is why domain-specific evaluation and having continuous evaluation loops matters. Without benchmarks built by domain practitioners, there is no way to determine whether a model genuinely interprets specialized terminology (legal jargon, medical history, financial data) or simply pattern-matches well enough to pass surface-level tests. The rest of this guide maps the public benchmarks that try to answer that question across each major specific domain, and where they fall short.

What Are the Most Important Medical AI Benchmarks in 2026?

Medical AI evaluation has the longest pedigree of any vertical. The field has moved from saturated multiple-choice ladders to rubric-graded multi-turn conversations, and the methodological gap between exam scoring and clinical scoring is now the central design question.

MedQA / MultiMedQA: saturated

  • Format: USMLE-style multiple choice
  • Scale: Standard medical licensing question set
  • Status: Saturated. PaLM 67.6% → Med-PaLM 2 86.5% → frontier models above 95%

MedQA and MultiMedQA, introduced with Med-PaLM in Singhal et al.'s 2023 Nature paper, became the standard medical multiple-choice ladders. Beyond licensing-style questions, MultiMedQA also probes factuality, potential harm, and the model's ability to map free-text clinical documentation onto standardized diagnostic codes. They no longer differentiate frontier models, but the multi-axis design they introduced (correctness plus harm plus coding fidelity) is the template most newer medical benchmarks build on. Specialized models in healthcare are tested for precision in reading complex terminology where a general model is likely to misread or oversimplify.

HealthBench: the unsaturated successor

  • Format: 5,000 multi-turn conversations, rubric-graded
  • Scale: 48,562 conversation-specific rubric criteria written by 262 physicians across 26 specialties and 60 countries
  • Status: Unsaturated. GPT-3.5 Turbo 16% → GPT-4o 32% → o3 60%

The unsaturated successor is HealthBench, released by OpenAI in May 2025. GPT-4.1 nano now beats GPT-4o at one twenty-fifth of the cost, a useful proxy for how fast capability is moving in medicine.

The methodological design is what matters most. HealthBench grades against physician-authored rubrics, not against gold answers. Each conversation has bespoke criteria (what the model must mention, what it must avoid, what level of caveat is appropriate) and an LLM judge scores against those criteria. This is the same evaluation pattern that working clinical reviewers use when assessing trainees, and it is the same pattern that vertical AI deployments need internally.

The unsolved problem: realism

A 2025 systematic review of 761 LLM medical evaluation studies found that only about 5% assessed models on real patient care data. Most benchmarks, including HealthBench, use synthetic conversations. A randomized trial referenced in the same review found that giving physicians access to GPT-4 did not significantly improve their clinical reasoning performance, despite the model's strong standalone scores. The lesson: a top HealthBench score is necessary but not sufficient. You still need clinicians grading model outputs on actual encounters before you can trust the model in a workflow.

How Do Open-Source Legal Benchmarks Evaluate LLMs?

Legal benchmarks have stratified into reasoning evaluations, retrieval evaluations, and rubric-graded practitioner-task evaluations. Models perform progressively worse as the evaluation moves closer to actual attorney work.

LegalBench: the breadth benchmark

  • Format: 162 tasks across six types of legal reasoning
  • Scale: Built by 40 lawyers, computational practitioners, and law professors
  • Status: The dominant English-language legal benchmark

LegalBench is described by its authors as "an ongoing open science effort." Its strength is breadth: rule recall, issue spotting, rule application, statutory interpretation, contract clause classification, and rhetorical understanding all sit inside one harness. The combined evaluation tests whether a model can interpret statutes, analyze legal precedents, and resolve linguistic ambiguity the way an attorney would on a real matter. Its weakness is that it tests reasoning over clean inputs.

LegalBench-RAG: the retrieval benchmark

  • Format: Query-answer pairs over real legal documents
  • Scale: 6,858 expert-annotated pairs over a 79-million-character corpus (NDAs, M&A agreements, contracts, privacy policies)
  • Status: First benchmark to evaluate the retrieval half of legal RAG

Production legal AI does not get clean inputs. It gets a 200-page master services agreement and a question that requires finding the right clause first. LegalBench-RAG addresses that gap. In production, retrieval is where most legal AI systems fail: the model reasons fine over the right context, but the right context never reaches it.

LawBench and PLawBench: the practitioner-task benchmarks

  • LawBench: 51 LLMs evaluated across 20 Chinese legal tasks structured by three cognitive levels (memorization, understanding, application). Authors conclude open-source models are "still a long way from usable."
  • PLawBench (2026 preprint): 850 questions, ~12,500 rubric items grounded in real legal workflows. None of 10 evaluated frontier LLMs achieves strong performance.

For non-English jurisdictions, LawBench finds that even supervised fine-tuning produces only modest gains. The 2026 PLawBench preprint, grounded in consultation, case analysis, and document generation tasks, is currently a preprint and should be cited as such.

The pattern across the legal stack is consistent: models do well on isolated reasoning tasks, weaker on retrieval over real document corpora, and worst on rubric-graded practitioner tasks that mirror actual attorney work. Practicing attorneys are the only graders who can validate jurisdiction-specific outputs reliably. The benchmarks abstract that judgment; production deployments cannot.

Can LLMs Be Trusted on Financial Reasoning Benchmarks?

Financial benchmarks expose a recurring weakness: small numerical errors cascade through multi-step reasoning, and average-case accuracy hides the worst-case behavior that actually matters in regulated workflows.

FinBen: the breadth benchmark

  • Format: 36 datasets, 24 tasks across 7 categories (information extraction, textual analysis, QA, generation, risk, forecasting, decision-making)
  • Scale: 21 LLMs evaluated at release
  • Status: The widest-coverage open finance evaluation

Financial evaluation has converged around FinBen as the broad reference. It also includes a trading task category, which most other open benchmarks omit.

FinQA, ConvFinBench, TAT-QA, BizFinBench: the reasoning ladders

  • FinQA: Multi-step reasoning programs over financial documents; ~84% human expert F1 ceiling
  • TAT-QA: Table-and-text reasoning over real financial reports
  • ConvFinBench: Conversational finance QA
  • BizFinBench: Cross-concept business analysis; finds models "struggle with complex scenarios requiring cross-concept reasoning"

The narrower ladder is FinQA, which annotates each question with a multi-step reasoning program over financial documents. FinQA exposes the central failure mode of LLM financial reasoning: small arithmetic errors cascade. A model that misreads one cell of a balance sheet and propagates the error downstream produces a confidently wrong answer. FinanceBench takes the same logic further, testing whether a model can compute composite metrics like EBITDA or P/E ratios from raw filings without dropping a sign or a decimal. BizFinBench's findings confirm the same brittleness FinQA exposed two years earlier.

Why benchmark scores understate financial risk

Financial errors are asymmetric in cost. A model that hallucinates a revenue figure in an SEC filing summary, or misclassifies a transaction in an AML pipeline, has direct downstream consequences. Public benchmarks do not measure that asymmetry. They measure averages. A 92% score that produces 8% wrong financial outputs in production is a regulatory liability, not a deployment success. Expert review on production outputs is the layer benchmarks cannot provide.

How Well Do LLMs Perform on Scientific and Chemistry Benchmarks?

Science benchmarks reveal a calibration problem that other domains hint at but chemistry makes explicit: high accuracy on average coexists with overconfident wrong answers on basics, and that mismatch is dangerous in laboratory contexts.

ChemBench: peer-reviewed in Nature Chemistry

  • Format: 2,788 question-answer pairs evaluated against 19 expert chemists
  • Scale: Published in Nature Chemistry (2025)
  • Headline finding: Best LLMs outperformed best human chemists on average
  • Caveat: Models "struggle with some basic tasks and provide overconfident predictions"

The strongest recent peer-reviewed citation in this space is ChemBench. Read the full paper and the headline becomes complicated. The authors report no correlation between molecular complexity and model accuracy: strong evidence that high scores reflect memorization of training-set chemistry rather than chemical reasoning. The model that beats expert chemists on average still misses elementary safety and analytical-chemistry questions, and reports its wrong answers with the same confidence as its right ones. For a domain where wrong answers can mean lab accidents or unsafe synthesis routes, "outperforms humans on average" is the wrong metric. Calibration and worst-case behavior matter more.

SciBench: the broader science benchmark

  • Format: College-level open-ended problems in math, physics, chemistry
  • Best-model score: ~43.22%
  • Status: Not saturated; no single prompting strategy dominates across subjects

SciBench shows the gap between exam performance and laboratory competence remains the same gap that the medical literature documents: strong test-taker, weak practitioner.

How Are Coding Benchmarks Evolving Beyond HumanEval?

Code evaluation has moved through three generations: function-level benchmarks that saturated, contamination-resistant harvesters that try to outrun training cutoffs, and harness-sensitive end-to-end evaluations where the same model produces wildly different scores depending on the test setup.

HumanEval and MBPP: saturated

  • Status: Saturated. The BigCodeBench paper states this directly: HumanEval and MBPP "have been saturated by recent model releases."

BigCodeBench: practical library-driven tasks

  • Format: Python tasks requiring diverse function calls and complex instruction-following
  • Replaces: Function-level benchmarks like HumanEval/MBPP

LiveCodeBench: contamination-resistant

  • Format: Continuously harvested competitive-programming problems from LeetCode, AtCoder, Codeforces
  • Design goal: Test cases postdate model training cutoffs
  • Authors' claim: "Resistant to both test set contamination and the pitfalls of LLM judging"

A model that has seen the test set in pretraining will score well on it regardless of capability. LiveCodeBench is the standard response.

SWE-Bench Verified vs SEAL: the harness sensitivity problem

  • Same model (Claude Opus 4.5): 80.9% on SWE-Bench Verified, 45.9% on SEAL
  • Implication: Half the score from the same model on tasks of comparable apparent difficulty

The most striking single data point in code evaluation is harness sensitivity. SWE-bench itself was designed to push past simple coding tests by requiring models to resolve real-world GitHub issues end to end, and the Verified subset filters those issues for solvability. Even on that hardened evaluation, when you read a code-benchmark headline, you are reading a model-and-harness pair, not a model. ResearchCodeBench, a Stanford evaluation of whether models can implement recently published ML papers from scratch, finds that the best models implement under 40% of papers correctly. Code generation looks far stronger on isolated problems than on real engineering work.

What About Cybersecurity, Multilingual, and Multimodal Evaluations?

These three domains share a common dynamic: public benchmarks have improved sharply but cannot replicate the conditions (adversarial environments, regional languages, real expert workflows) where production deployments actually live.

Cybersecurity: CyberSecEval and CyberMetric

  • CyberSecEval / CyberSOCEval (Meta): Extends coverage to malware analysis and threat-intelligence reasoning. Authors note prior benchmarks "do not fully assess the scenarios most relevant to real-world defenders."
  • CyberMetric: 80/500/2,000/10,000-question RAG-grounded benchmark built from NIST standards, RFCs, and academic literature. Validated against 30 human participants and 200+ expert review hours.

Cybersecurity evaluation moved from generic capture-the-flag scoring to operational tasks in 2025. CyberSecEval and CyberMetric are the leading examples. The methodological standard in cybersecurity benchmarking has risen sharply, but public benchmarks still cannot replicate adversarial conditions in private environments.

Multilingual: MMLU-ProX

  • Format: 11,829 parallel questions across 29 languages
  • Headline finding: Performance gaps "of up to 24.3%" between high- and low-resource languages on the same questions
  • Most affected: African languages

Multilingual evaluation is where the production gap is largest. MMLU-ProX, introduced at EMNLP 2025, makes this measurable. For European or non-English enterprises, the implication is direct: an English MMLU score does not predict same-quality performance in French, Arabic, or Swahili, and the gap widens as resources thin.

Multimodal: MMMU and MMMU-Pro

  • MMMU: 11,500 multimodal expert questions across 30 subjects, 183 subfields. Approaching its human ceiling (76.2–88.6%); Gemini 3 Flash at 87.63%.
  • MMMU-Pro: Filters text-only-answerable questions, adds distractors, embeds questions inside images. Scores drop 16.8–26.9% from MMMU across all evaluated models.

Multimodal evaluation has followed the same saturation-then-fragmentation arc. MMMU is approaching ceiling, and the successor MMMU-Pro hardens the evaluation. That drop reflects an evaluation hardening, not a capability regression: models exploit text-only shortcuts when image and question are separable, and MMMU-Pro removes the shortcut.

Why Do Domain-Specific Benchmarks Still Fail in Production?

Three failure modes recur across every domain covered above.

Failure mode 1: Exam performance ≠ workflow performance

The 5% real-patient-data finding from the medical systematic review is one example. The Stanford ResearchCodeBench finding that models implement under 40% of recent ML papers is another. Models that ace standardized questions often falter on the unstructured, multi-step work that real practitioners do.

Failure mode 2: Contamination

Pretraining corpora include NIST RFCs, USMLE archives, SEC filings, LeetCode problems, and most of the academic chemistry literature. A 2024 GSM8K analysis showed accuracy dropping 13% when contaminated examples were removed. Contamination-resistant designs like LiveCodeBench help, but the broader corpus exposure problem is structural. Domain benchmarks built on public expert content are by definition partially in the training set of any frontier model.

Failure mode 3: Harness sensitivity

The 80.9% versus 45.9% Opus 4.5 split is the cleanest illustration: change the test harness, change the score by half, on the same tasks at the same nominal difficulty. The headline number you read is downstream of dozens of evaluation choices that rarely make it into the benchmark report.

A counter-intuitive cross-domain finding

TRIDENT-Bench, the first systematic LLM safety benchmark spanning finance, medicine, and law, concludes that "domain specialization alone does not guarantee ethical robustness, and in some cases may increase failure rates." A finance-specialized model is not automatically safer for finance. Specialization can narrow capability while widening blind spots, particularly on safety-adjacent edge cases that fall outside the specialization training distribution. The same paper that confirms vertical benchmarks are useful also confirms that they are not sufficient.

How Is Regulation Reshaping Domain Benchmarks?

Benchmarks are no longer purely academic instruments. The EU AI Act incorporates them in several binding provisions. Eitel-Porter et al.'s 2025 review traces benchmark references through Articles 15(2), 51(1), 58(2), 66(g), and 68(3): for high-risk systems, benchmarks are expected to play a "significant role" in compliance with accuracy, robustness, and cybersecurity requirements; for general-purpose AI models with systemic risk, they are described as "fundamental" to classification.

In the United States, NIST AI 600-1, the Generative AI Profile released in July 2024, extends the AI Risk Management Framework specifically to generative-AI risks: confabulation, IP exposure, harmful content, and downstream misuse. The Profile does not mandate specific benchmarks, but it expects organizations to maintain evidence of evaluation against the risks it enumerates. Audit trails, not screenshots of leaderboards, are the deliverable.

For organizations deploying domain-specific AI in regulated industries, this changes the procurement calculation. A vendor presenting a HealthBench score is no longer presenting evaluation. They are presenting one input to evaluation. The full deliverable is reproducible, auditable evidence that the deployed system performs adequately on the regulator's risk dimensions, validated by qualified experts. Compliance and security evaluations now check whether models adhere to industry standards and the regulatory frameworks governing finance, law, and healthcare, and that adherence has to be testable on demand. That evidence is something benchmarks contribute to, not something they replace.

What Does Production-Ready Domain Evaluation Look Like?

The honest answer is that it is multi-layered, and benchmarks are the cheapest layer.

A production-grade evaluation stack typically runs three layers:

  • Automated metrics (accuracy, BLEU, F1, latency) give continuous signal at scale.
  • LLM-as-judge evaluations, especially rubric-graded judges of the kind HealthBench pioneered, give scalable approximations of expert judgment on subjective dimensions.
  • Human expert review, the most expensive layer, is the only one that produces ground truth on the actual outputs your model generates in your actual workflow.

Each layer constrains the layer below it. Automated metrics flag regressions; LLM judges localize them; experts adjudicate the cases where the judges disagree.

Two practical points sit underneath this stack. First, evaluation needs a rigorous and repeatable testing process so that practitioners can establish baselines and compare models across fine-tuning runs, RAG retrieval changes, and prompt-engineering revisions. A model that performs well in a one-off test but cannot be evaluated identically a month later is not production-ready, however high its benchmark score. Second, every layer of the stack depends on data quality at the input. High-quality, accurately annotated domain-specific data is what distinguishes a useful evaluation from a noisy one across fine-tuning datasets, RAG knowledge bases, and the rubrics human experts use to grade outputs. Effective annotation processes, run by people who understand the domain, are what allow the data to reflect the nuances and edge cases that matter in the specific field.

MIT Technology Review's 2026 piece on the limits of benchmarks puts it sharply: "Today's benchmarks resemble school exams, one-off, standardized tests" that bear little resemblance to how AI is actually used. The piece argues for a Human-AI Collaboration paradigm in which competence is treated as relational and longitudinal, the way junior doctors and lawyers are evaluated continuously inside real workflows under supervision. That is the direction domain evaluation is heading, and it is the gap that no public benchmark closes on its own.

The Real Lesson of Domain Benchmarks

Vertical benchmarks are not a replacement for general benchmarks. They are a reaction to general-benchmark saturation. They give procurement teams sharper questions to ask and give research teams cleaner targets to optimize against. HealthBench, LegalBench-RAG, ChemBench, FinBen, BigCodeBench, MMLU-ProX, MMMU-Pro: each is a real advance over the generic leaderboards they partially displace.

What they do not do is close the production gap. The 5% real-patient-data finding, the 80.9-to-45.9 harness collapse, the 24.3-point multilingual gap, ChemBench's overconfidence finding, and TRIDENT's specialization-without-safety result all point in the same direction: the strongest public benchmark in any vertical is a filter, not a verdict. Benchmarks tell you which models are worth evaluating against your real workflow. Real-workflow evaluation, graded by qualified domain experts, is what tells you whether the model is safe to ship.

For enterprises building in regulated verticals, the practical implication is that the benchmark layer and the expert-review layer have to coexist. Public benchmarks give you comparability and external validation. Verified domain experts (clinicians, attorneys, chemists, security analysts, multilingual specialists) give you ground truth on the outputs the model actually produces in your stack. The 2027 enterprise GenAI deployment that Gartner is projecting will run on both.

What Reliable Domain-Specific Evaluation Actually Requires

Public domain-specific benchmarks have done what saturated general benchmarks could not: they have given enterprises a credible filter for which large language models are worth shortlisting in a specific field. They have not given a verdict. The failure modes documented above (the 5% real-patient-data gap, the 80.9-to-45.9 harness collapse, ChemBench's overconfidence finding) are visible only when domain experts grade the model's outputs on real workflow data.

That work sits below the benchmark layer. It needs verified domain experts, evaluation traceable enough to satisfy the EU AI Act and NIST AI RMF, and infrastructure to run evaluations as rigorously as the benchmarks they extend. Kili Technology provides this layer as a managed service: 2,000+ verified domain specialists, evaluation as a first-class service rather than a tooling add-on, and full traceability from annotator decision to model score, with European data sovereignty and on-premise deployment for regulated industries.

Resources

Medical Benchmarks

Legal Benchmarks

Financial Benchmarks

Science and Chemistry Benchmarks

Code Benchmarks

Cybersecurity Benchmarks

Multilingual and Multimodal Benchmarks

Cross-Domain Safety

Market and Policy

Further Reading