AI Summary
- A 2025 review of 445 LLM benchmarks found pervasive construct-validity gaps across widely cited ai benchmarks.
- HELM's 7-metric design raised cross-model evaluation coverage from 17.9% to 96%.
- GPQA Diamond's expert-write, expert-validate, non-expert-test pipeline is the cleanest reusable protocol for sourcing tasks from domain experts.
- SWE-bench scores rose from 1.96% to 71.7% in one year — and a later human-filtered subset removed roughly a third of the original tasks as ambiguous or infeasible.
- LLM judge reliability hinges on rubric clarity, not chain-of-thought; the rubric is the operationalisation of the construct you claim to measure.
- Kili Technology supports the upstream work behind reliable custom AI benchmarks — expert annotator workflows, rubric design, multi-annotator validation, and ongoing re-annotation as models and production traffic evolve.
Introduction
Three years ago, "build a benchmark" meant pick a task, write a few hundred examples, publish a leaderboard. The methodology was loose because the time horizons were long: MMLU, released in 2020, took until 2024 to saturate.
That assumption is gone. The 2025 AI Index reports gains of 18.8 points on MMMU, 48.9 on GPQA Diamond, and 67.3 points in SWE-bench scores in a single year. Humanity's Last Exam, released in early 2025 with the explicit goal of resisting frontier reasoning models, went from 8.8% (top score, January 2025) to over 50% by April 2026. The exam was designed as the hardest standardised exam ever assembled for an LLM, and even an exam of that calibre hit the saturation curve within fifteen months. Stanford HAI's Vanessa Parli framed the resulting question bluntly: are we measuring the right thing, are the benchmarks compromised, and how should the research community evaluate models?
The pressure is sharper for teams running production systems. Public leaderboards optimise for capability ceilings; production systems break on consistency floors. An MIT NANDA analysis of roughly 300 enterprise AI deployments found only 5% reach measurable P&L impact, and the binding constraints are workflow integration and evaluation gaps, not model quality. Gartner expects more than 40% of agentic AI projects to be cancelled by 2027.
A custom benchmark is no longer optional infrastructure for any team deploying AI in a domain that matters. The question is how to evaluate the system you actually ship without repeating the validity, contamination, and saturation mistakes that the public ai benchmarks are now publicly working through.
Why Are Public AI Benchmarks No Longer Enough?
Three forces have converged.
The first is saturation speed. When the dataset was introduced in late 2023, the best model (Claude 2) solved 1.96% of real GitHub issues drawn from SWE-bench. By 2024, leading research systems and agents were solving 71.7%. The benchmark community responded with harder variants — a verified human-filtered subset and a Pro tier — but the underlying pattern is structural: any static benchmark hard for today's frontier reasoning models will be solved by next year's. A useful heuristic for benchmark builders is that top models should land below roughly 35% accuracy at launch — anything easier is already a regression eval in disguise, and the resulting scores will not separate frontier systems from each other.
The second is construct validity erosion. The 2025 Reuel et al. systematic review of 445 LLM benchmarks identified prevalent gaps in construct validity, the property that the test actually measures the capability it claims to measure. Naming a benchmark "general reasoning" or "general knowledge" doesn't establish that the score generalises to the construct. Raji et al. flagged this in 2021, calling general-purpose benchmark framing "ultimately dangerous and deceptive." The 2025 review confirms the problem is endemic, not isolated.
The third is benchmark exploitability. Recent research has shown that headline scores often measure how well a model gamed the test harness rather than how well it solved the underlying tasks. Automated scanning agents have been demonstrated that exploit structural flaws in popular ai benchmarks — for instance, the lack of strict isolation between the agent under test and the evaluator process — to achieve near-perfect scores without solving any of the tasks. The attack pattern unfolds in stages: the scanning agent probes the harness, identifies leaked grader signals, and produces outputs that satisfy the grader without solving the problem. The takeaway is unambiguous: high scores on static benchmarks can be deeply misleading, and inflated leaderboard scores can be uncorrelated with the underlying capabilities.
The fourth is the research-versus-production divergence. Public benchmarks measure peak capability on single attempts. Production systems need consistency: an Anthropic engineering analysis of τ-bench found agents hitting 60% pass@1 dropped to 25% pass^k (consistency across k trials). That gap is invisible in single-run leaderboards and catastrophic for users. The same analysis pushes teams toward pass@k and pass^k, partial-credit graders, and balanced positive/negative cases, none of which are standard on public leaderboards.
If your system processes legal contracts, diagnoses medical images, or routes financial trades, the relevant comparison isn't whether your model beats GPT-4 on MMLU. It's whether it does the specific job you need it to do, reliably, on the kind of inputs your production traffic actually contains.
What Makes a Custom AI Benchmark Actually Measure What It Claims?
Construct validity is the first principle. Before you draft a single task, write down (in a paragraph, not a vibe) what capability you're measuring, why it matters for your system, and what it would mean for the score to go up or down. If you can't articulate the construct, you can't measure it. In practice this also means aligning the benchmark to a concrete business outcome or domain task — what "right" looks like — rather than chasing an abstract aggregate score.
The Reuel et al. research proposes four validity types worth carrying through:
- construct (does it measure what it names), criterion (does it correlate with downstream outcomes),
- consequential (does optimising for it produce the behaviours you want), and
- external (does it generalise outside the eval set).
A benchmark that ignores any of the four is gameable.
The second principle is multi-metric coverage with explicit gaps acknowledged. The HELM framework from Stanford CRFM rejected single-number leaderboards and instead reported 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) across 16 core scenarios. The point wasn't completeness; it was making trade-offs visible to anyone reading the data. Before HELM, models were evaluated on just 17.9% of its core scenarios on average; HELM raised this to 96%, putting 30 models on equal footing for direct comparison. For a custom benchmark, the lesson is to pick three or four metrics that capture the trade-offs you actually face (latency vs. accuracy, recall vs. precision, helpfulness vs. harm) and report them all, every time. A workable pattern is to lead with one primary metric for at-a-glance understanding and relegate detailed secondary metrics — and the parameters used to compute them — to an appendix that lets another team replicate the run end-to-end.
The third principle is balanced problem sets. Anthropic's engineering guidance phrases it directly: include cases where the behaviour should occur and cases where it shouldn't. A benchmark of "did the agent successfully cancel the order" is incomplete without "did the agent refuse to cancel an order it shouldn't have." Balanced sets are how you catch over-eager agents and miss-rate trade-offs that single-axis evaluation hides. Edge cases and adversarial items belong in the same set — silent failures hide between average and worst-case behaviour.
Construct validity is also where rubric design enters early. Park et al. (2025) showed empirically that evaluation criteria are the dominant factor in LLM judge reliability; chain-of-thought offers minimal gains when the rubric is clear. The rubric is the operationalisation of the construct. If the rubric is vague, the construct is vague, and no amount of grader sophistication will save the score. A clear rubric also accelerates downstream understanding: when scores diverge across model versions, the team can read the rubric and trace the divergence to a specific criterion rather than re-arguing the underlying construct.
This is also where Kili-style annotation infrastructure starts to matter, not for the eval itself, but for the upstream work of writing the rubric, validating it against expert disagreement, and revising the data collection process before any tasks are graded.
Where Should the Tasks Come From, and Who Should Write Them?
Two patterns from the public ai benchmarks dominate:
Real artefacts beat synthetic prompts
SWE-bench drew its 2,294 tasks from real GitHub issues across 12 popular Python repositories and graded by running the project's actual test suite. A patch passes only if it makes the failing tests pass without breaking the passing ones (FAIL_TO_PASS / PASS_TO_PASS). The realistic construction setting, in the authors' phrasing, gave the dataset properties no synthetic prompt collection could replicate: tasks are continually updatable from new pull requests, hard to game with surface heuristics, and grounded in code that other humans actually had to review. For a custom benchmark, the implication is to mine your own systems first. Bug trackers, support tickets, rejected agent outputs, escalations to human review: these are the highest-signal task sources you have.
Expert authoring with adversarial validation.
GPQA Diamond is the cleanest published example. Its four-stage pipeline (expert authoring, expert validation, revision, non-expert validation) produced 448 multiple-choice questions where domain PhDs reach 65% accuracy, 74% if you discount clear mistakes, but skilled non-experts (humans with 30+ minutes of unrestricted web access) reach only 34%. The Diamond subset is the high-confidence slice — questions where two domain experts agreed on the answer and a third validated it independently. The non-expert validation stage is the underappreciated part: it gives you a quantitative answer to "is this question actually hard, or just obscure?" GPQA Diamond also embeds a canary string for contamination tracing, a small touch with disproportionate downstream value.
LegalBench is the canonical example of the expert-led collaborative model: 162 tasks across 6 reasoning types (issue-spotting, rule-recall, rule-application, rule-conclusion, interpretation, rhetorical understanding) authored by 40+ contributors, including lawyers, law professors, and legal practitioners. The reasoning typology is itself a contribution: it forces task authors to articulate what kind of legal cognition each task is testing, which makes downstream interpretation of scores tractable.
Four operational rules follow:
- Start with real failures, then scale. Anthropic's guidance recommends 20–50 tasks drawn from production failures as a starting point; the working bar for a defensible domain benchmark sits higher, at a hand-picked, expert-labelled set of 200–1,000 examples that reflect real user journeys and edge cases for the certain use cases your agents actually serve in production. Treat dataset construction as a series of stages, not a one-shot exercise: a small seed set, then expansion, then iteration as failures surface.
- Write reference solutions — if your domain experts can't solve the task, the model definitely can't, and the failure tells you nothing.
- Use multiple raters per item to evaluate annotation quality: the verified-subset protocol used three annotators per item, severity-ensembled, to filter out infeasible or under-specified items, and produced what is now the de facto standard for executable-test benchmarks; single-rater work inherits single-rater blind spots, and using too few raters strips out the human disagreement that is itself a signal about task ambiguity.
- Keep the test set strictly separated from anything used to develop the system — train/test contamination quietly inflates scores on held-out evaluations, and the inflation is invisible until production performance disappoints.
How Do You Grade Outputs That Don't Have a Single Right Answer?
The Anthropic taxonomy is the cleanest framing: code-based graders, model-based graders (LLM judge), and human graders. Pick the cheapest grader that works for the task — and let the cost of being wrong, not the convenience of measuring, drive the choice.
Code-based graders are the gold standard when applicable: exact match, regex, executable tests, structured output validation. The FAIL_TO_PASS / PASS_TO_PASS pattern from the original benchmark is a code-based grader; so is "does the JSON parse and contain the required fields." If the task admits a code-based grader, use it. The grader is deterministic, free to run, and impossible to game without solving the task — and for agents that produce structured outputs, this is the cheapest reliable signal you will ever have.
Model-based graders are the workhorse for open-ended outputs. Park et al. showed evaluation criteria dominate reliability; rubric-guided judges (Prometheus, G-Eval) reach Pearson correlations around 0.897 with humans rating the same outputs when the rubric is unambiguous. The LeMAJ legal evaluation framework found inter-rater agreement among humans increased 11% when reviewers used a shared rubric, and rubric-guided LLM judge configurations hit Cohen's κ of 0.75 with human consensus. The pattern across these results: rubric clarity is the constraint, not judge model size.
Three operational rules. Validate against a human-labelled golden set — the working bar in practice is 75–90% agreement with human consensus; below that, the judge is amplifying noise. Decompose into structured criteria — binary checks ("does the output cite a real case?") plus ordinal scores ("rate clarity 1–5 with anchor descriptions") outperform single overall scores in nearly every published comparison. Watch the known biases. Model-based judges show position bias, length bias, and self-preference (preferring outputs from the same model family). Randomise position, normalise length when possible, and use a different model family for judging than for the system under test.
Human graders are the calibration layer and the high-stakes layer. Reserve humans for golden-set construction, judge calibration, and tasks where the cost of a wrong answer makes a 90%-agreement model judge unacceptable: clinical safety, legal compliance, financial advice. The economics rarely support running humans across a full benchmark, but they almost always support running humans across a calibration subset — and the resulting scores anchor every other grader downstream.
How Do You Keep a Custom AI Benchmark Useful Over Time?
A custom benchmark is a versioned artefact with a maintenance schedule. Treat it that way and it stays useful; treat it as a one-time project and it expires inside a year.
Contamination defence. The Singh et al. (2024) ConTAM analysis of 13 benchmarks across 7 models found contamination has been underestimated in many prominent LLM releases, even when developers attempted decontamination. Three practical defences: hold out a private split that never goes public; date-stamp every item so you can filter to post-cutoff data per model (the LiveCodeBench approach, which date-stamps problems from competitive programming contests since May 2023); and embed a canary string, GPQA Diamond–style, so you can later test whether a model has memorised your dataset.
Harness isolation. A subtler failure mode is the absence of strict isolation between the system under test and the evaluator. When agents can read or write to the same filesystem as the grader, observe the grader's logs, or otherwise inspect the scoring process, automated exploits become trivial — and as recent scanning-agent research shows, headline scores in those conditions can reflect harness gaming rather than genuine capabilities. Build agent/evaluator isolation into the design, not as a follow-up.
Saturation planning. Distinguish capability evals from regression evals from day one. A capability eval and a regression eval are two different stages of the same benchmark's life: capability evals start at low pass rates (5–30%) and let you hill-climb; once they saturate above 90%, they become regression evals, where the goal flips from "can the model do this" to "did we break something that used to work." Both stages matter. A team with only capability evals goes blind once the model is good; a team with only regression evals never sees what it can't do yet, and the scores stop reflecting any meaningful comparison between systems.
Continuous re-annotation. Pipe production failures back into the eval set. Every time a user flags a wrong answer, every human override, every escalation: these are pre-validated hard cases. The BetterBench framework found that of 24 evaluated SOTA benchmarks, only 3 included CI build status and only 4 provided replication scripts; the operational rigour we apply to code we routinely fail to apply to evals. Version your benchmark against your model releases. Tag eval runs with model version, prompt version, and rubric version.
Standards alignment. For regulated industries, the NIST AI RMF and the NIST GenAI Profile (AI 600-1) define testing, evaluation, verification, and validation (TEVV) as a core function of trustworthy AI. Mapping your benchmark to TEVV categories isn't decorative; it's the documentation trail you'll want when the auditor arrives.
What Does This Look Like Across Industries?
Four compressed examples to ground the principles. Each is anchored to a published benchmark so the design choices are inspectable.
Healthcare. MultiMedQA combines six existing medical QA datasets with a new HealthSearchQA collection of consumer questions, then layers expert physician evaluation across multiple axes: factuality, possible harm, possible bias, scientific consensus alignment. The lesson: USMLE-style multiple-choice catches only the lower levels of Miller's pyramid (knows, knows how). Clinical safety requires open-ended generation graded by clinicians on multiple safety axes, not just answer accuracy.
Legal. LegalBench's six reasoning types plus LegalBench-RAG's expert-annotated retrieval pairs together cover both answer correctness and retrieval-precision dimensions. The lesson: in regulated, citation-heavy domains, retrieval correctness matters as much as final-answer accuracy. A benchmark that grades only the synthesis ignores the failure mode that matters most: confidently wrong citations.
Software engineering. The progression from the original benchmark to its verified subset to Terminal-Bench is the cleanest example of refinement under pressure. The OpenAI Verified work filtered roughly a third of original items as ambiguous or infeasible, meaning a third of the original scores were noise from the benchmark, not signal from the model. The lesson: for any executable benchmark, agent harness bugs and grading-spec ambiguity cause more apparent failures than model limitations. Verify before you trust.
Finance. The Finance Agent Benchmark provides 537 expert-authored questions covering retrieval through modelling, with an agentic harness that includes Google Search and SEC EDGAR access. The lesson: financial benchmarks need expert-authored questions, real document grounding, and tool-use evaluation; pure-text QA misses the workflow. The regulatory and compliance dimension is also load-bearing: a model that's right but cites a hallucinated 10-K creates legal exposure that pure accuracy metrics never surface.
The pattern across all four: the benchmark inherits the failure modes of the domain. Generic benchmarks miss these because they're generic. Custom benchmarks earn their cost by being specific.
The Real Test of a Benchmark Is Whether You'd Trust It Tomorrow
The benchmark you build today will be consulted dozens of times before it expires. Each consultation is a decision: ship or don't, escalate or don't, retrain or don't. The cost of a bad benchmark isn't measured in eval-set creation hours; it's measured in the production decisions made in its name.
The public ai benchmarks that survived their first wave of scrutiny (HELM, GPQA Diamond, SWE-bench, LegalBench) share a small set of properties. They define the construct in writing. They use real artefacts when possible. They validate task design against domain experts before grading any model. They publish their rubrics. They plan for contamination, harness exploits, and saturation as design constraints, not afterthoughts. They version themselves like software.
These properties don't require frontier-lab budgets to replicate. They require treating the benchmark as the substrate of every downstream claim about your AI system, because that's what it is. The benchmarks that hold up are the ones built by people who understood the construct before they wrote the first task.
Ready to Build a Benchmark That Actually Measures Your AI System?
Kili Technology's data labeling and evaluation infrastructure supports the upstream work that makes custom benchmarks reliable: expert annotator workflows, rubric design and calibration, multi-annotator validation, and the continuous re-annotation cycle that keeps benchmarks useful as models and production traffic evolve. Talk to our team about benchmark design for your domain.
Resources
Benchmark Methodology Papers
- Holistic Evaluation of Language Models (HELM) – Stanford CRFM's multi-metric, multi-scenario framework
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark – Expert-authored, expert-validated, non-expert-tested protocol
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? – Real-artefact tasks with executable graders
- LiveCodeBench: Holistic and Contamination Free Evaluation of LLMs for Code – Date-stamped, contamination-aware design
- LegalBench: A Collaboratively Built Benchmark for Legal Reasoning – Expert-led collaborative construction with reasoning typology
- MultiMedQA / Large Language Models Encode Clinical Knowledge – Multi-axis human evaluation for medical LLMs
- Finance Agent Benchmark – Expert-authored finance benchmark with agentic harness
Benchmark Quality and Validity Research
- Measuring What Matters: Construct Validity in LLM Benchmarks – 445-benchmark systematic review
- BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices – Operational checklist for benchmark quality
- Evaluation Data Contamination in LLMs: How Do We Measure It and (When) Does It Matter? – ConTAM analysis across 13 benchmarks × 7 models
- An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability – Rubric clarity dominates reliability
- LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation – Rubric-guided judging in legal domain
Industry Reports and Practitioner Guidance
- Demystifying Evals for AI Agents (Anthropic Engineering) – Three-grader taxonomy and capability vs. regression evals
- 2025 AI Index Report — Technical Performance (Stanford HAI) – Benchmark saturation data
- Introducing SWE-bench Verified (OpenAI) – Human-filtered SWE-bench subset methodology
- Bridging the Operational AI Gap (MIT Technology Review Insights) – Enterprise AI deployment gap analysis
Standards and Policy
- NIST AI Risk Management Framework (AI RMF 1.0) – Foundational TEVV definitions
- NIST AI Risk Management Framework: Generative AI Profile (AI 600-1) – Pre-deployment testing guidance for GAI
.png)


