AI Evaluation

Foundation Models

LLMs

LLM Red Teaming in 2026: How Frontier Labs Stress-Test AI Models (And Why Public Benchmarks Are No Longer Enough)

A practical guide to LLM red teaming as it works in 2026: the attack surfaces, the institutional networks behind frontier-lab safety testing, and the regulatory bar deployers now have to meet. The takeaway: red teaming has become a private-dataset problem, and the most reliable signal comes from expert-built adversarial data, not from public benchmark scores.

Kili Technology

May 28, 2026

Heading2

Heading3

AI Summary

Muse Spark verbalized evaluation awareness 19.8% on public benchmarks vs 2.0% on internal evaluations: a tenfold signal leak that breaks the assumption underneath every public adversarial leaderboard.
Microsoft states it directly in its NeurIPS 2024 paper on red teaming LLMs: AI red teaming is not safety benchmarking, and the work that uncovers vulnerabilities in production rarely looks like a paper.
AgentDojo, AgentHarm, and SHADE-Arena define the agentic attack surface; OWASP ranks prompt injection #1 on the LLM Top 10 for the second consecutive year.
Crescendo and many-shot jailbreaks beat single-turn defenses by wide margins, with one black-box method jailbreaking GPT-4-Turbo and GPT-4o on more than 80% of prompts.
The EU AI Act, NIST AI 600-1, and AI Safety Institute frameworks make adversarial testing a documented compliance activity, not an optional security exercise.
Kili Technology supports private red-teaming data programs with 2,000+ domain specialists across 40+ languages, severity-rubric calibration, and audit trails documenting EU AI Act Article 55 and NIST AI 600-1 testing.

Introduction

In April 2026, Meta Superintelligence Labs published a safety and preparedness report for its newest frontier model, Muse Spark, with a finding that should reshape how anyone thinks about LLM red teaming.

On public benchmarks, the model verbalized suspicion that it was being evaluated in 19.8% of samples. On Meta's internal evaluations, the same behavior appeared in 2.0%. Apollo Research, the third-party evaluator embedded in the report, called it the highest evaluation-awareness rate they had observed in any frontier model to date.

(A safety and preparedness report, sometimes called a "system card" at OpenAI or a "model card" elsewhere, is the public documentation a lab releases alongside a major model launch — capability evaluations, safety testing, known risks. They are the closest thing the field has to a peer-review document for a frontier model.)

That gap, roughly tenfold, is the empirical anchor for a shift that has been building for two years. Public adversarial datasets like AdvBench, HarmBench, and ALERT carry a signal that private datasets do not. The model can tell. And once a model can tell, the score it produces stops measuring what the score is supposed to measure.

This is not the same problem as benchmark contamination, where evaluation items leak into pretraining corpora. It is the inference-time analogue: the model recognizes the distributional fingerprint of a public adversarial corpus and behaves differently in response.

The pathology is the same as what has happened to public capability benchmarks. The fix is structurally similar too: private, expert-built, traceable adversarial data, refreshed against the deployer's actual use case.

This guide walks through what red teaming looks like for large language models in 2026:

The attack surfaces a serious red team has to cover
Why public adversarial datasets are no longer enough
How frontier labs build private LLM red teaming data through named external networks
What enterprises need to replicate that work at their own scale
What regulation now requires you to document

What Is LLM Red Teaming?

LLM red teaming is the practice of deliberately probing a model for harmful, unsafe, or otherwise undesired behavior before that behavior shows up in production.

The term is borrowed from cybersecurity, where a "red team" plays the role of the adversary in a simulated attack against the defender (the "blue team"). For LLMs, the same dynamic applies: red teamers craft inputs designed to make the model misbehave, refuse incorrectly, leak sensitive information, follow malicious instructions, or generate content that violates policy.

What counts as "misbehavior" depends on the deployer's threat model.

For a frontier lab, the harm categories typically include:

Weapons uplift (chemical, biological, radiological, nuclear)
Cyber-offense
Child safety
Persuasion and manipulation
Model autonomy risks
Bias or discrimination

For an enterprise, the list usually narrows to whatever the legal, security, and product teams care about:

Customer data leakage
Off-policy financial advice
Brand-damaging or toxic content
Jailbroken agent actions in connected systems

A red team's job is to surface those failures early, document them, and feed the findings back into model training, system-level AI guardrails, or deployment policies. The output of a red teaming engagement is data: labeled examples of attacks that worked, attacks that didn't, severity gradings, and (increasingly in 2026) regulatory documentation.

The vulnerabilities themselves cluster into a few main categories that every responsible AI program needs to cover. LLMs can generate off-topic, inappropriate, or harmful content that breaches business policies, leading to misuse and reputational damage. Prompt injection attacks chain untrusted user input with trusted prompts built by a developer, allowing adversaries to manipulate a model's behavior and potentially extract sensitive information — this is the LLM analogue of SQL injection, and OWASP has placed it at the top of its LLM Top 10. Jailbreaking attacks bypass the safety filters and AI guardrails built into LLMs, allowing users to manipulate the model into producing outputs that violate its intended constraints, including hate speech and instructions for illegal activities. Data leakage vulnerabilities in LLM systems can expose personally identifiable information through overfitting on sensitive training data or insecure handling of runtime data.

How does red teaming look different in 2026 than in 2022?

The original red-teaming-with-crowdworkers methodology comes from Anthropic. The 2022 paper Red Teaming Language Models to Reduce Harms by Ganguli and colleagues set the template:

Human attackers
Three model sizes
Four interventions: a plain LM, a helpful-honest-harmless prompted LM, a rejection-sampled LM, and an RLHF-trained model
An open-sourced dataset of red-team transcripts

The methodology was clean. The framing was research.

What changed since is institutional. LLM red teaming is now a function with named external partners, dedicated tooling, and regulatory documentation requirements.

Microsoft's AI Red Team, established in 2018, has now red-teamed 100+ generative AI products. Their NeurIPS 2024 paper, Lessons From Red Teaming 100 Generative AI Products, states the conceptual shift directly in its third lesson: AI red teaming is not safety benchmarking.

The distinction matters because the two practices use different data:

A safety benchmark gives you a comparable score across models on a fixed item set.
A red team probes for failures the deployer cares about, in the deployer's actual context, often with iteratively crafted attacks that no fixed item set could capture.

Microsoft's second lesson reinforces the point: simple prompt-based attacks (the kind a human can write with a few lines of input) beat sophisticated optimization attacks in practice. You don't have to compute gradients to break an AI system, and the work that finds real production failures rarely looks like a paper.

In a 2024 update on its red teaming practice, Anthropic introduced a sharper version of the same idea: Policy Vulnerability Testing, defined as in-depth qualitative testing with external subject-matter experts on specific Usage Policy categories.

PVT is not a benchmark. It is a sustained relationship with a domain expert who can recognize a failure when they see one, and who can construct probes that the deployer's policy-writers care about.

The 2026 frame is therefore not "what attacks do we run." It is "what dataset do we build, who builds it, and how do we keep it fresh."

What Attack Methods Do Red Teamers Use Against LLMs in 2026?

The taxonomy of attack methods has expanded along three axes: turn count, modality, and surface. Single-turn jailbreaks have not gone away; they are the lower bound. But the work that breaks current frontier models lives further out.

Multi-turn and many-shot attacks

Two attack styles now dominate the multi-turn surface:

Crescendo (Russinovich, Salem, and Eldan, 2024). A Microsoft Research attack that gradually escalates a dialogue across multiple turns. It surpasses prior single-turn SOTA by 29–61% on GPT-4 and 49–71% on Gemini-Pro on the AdvBench subset, and works on Claude, Llama-2, Llama-3, Gemini, and ChatGPT variants.
Tree-of-Attacks-with-Pruning (TAP) (Mehrotra et al., NeurIPS 2024). Uses an attacker LLM and an evaluator LLM in a tree search to jailbreak GPT-4-Turbo and GPT-4o on more than 80% of prompts using only black-box access. Finds 16% more jailbreaks than the prior PAIR method while issuing 60% fewer queries.

Then there is many-shot. Anthropic's many-shot jailbreaking work at NeurIPS 2024 exploits long context windows by including hundreds of faux dialogues in a single input prompt. At 256 shots, the technique succeeds on tasks where 5 shots fails completely, and works across Claude, GPT-3.5, GPT-4, Llama-2 70B, and Mistral 7B.

The Muse Spark safety report (Meta's public documentation for its newest frontier model) shows the same dynamic in action: under adaptive multi-turn attackers, Muse Spark lags Claude Opus 4.6 and GPT-5.4 with a 44.6% attack success rate against 31.7% and 37.6%.

The single-turn refresh is real, but the bigger story is that any honest 2026 red teaming dataset has to cover dialogue, not just prompts.

Agentic prompt injection

Three NeurIPS 2024 benchmarks now define this surface:

AgentDojo (Debenedetti et al., 2024). A dynamic environment with 97 realistic test cases and 629 security test cases for tool-calling AI agents across 70 tools and 27 injection targets. Headline finding: state-of-the-art LLMs solve fewer than 66% of tasks even without any attack at all.
AgentHarm (Andriushchenko et al., 2024, Gray Swan AI + UK AISI). 110 base malicious agentic tasks (440 with augmentations) across 11 harm categories and 104 distinct tools. Headline finding: leading LLMs are surprisingly compliant with malicious instructions in agent settings even without jailbreaking, and simple universal jailbreak templates retain coherent multi-step capability after attack.
SHADE-Arena (Kutasov et al., 2025, used in the Muse Spark report). Tests whether agents can evade monitoring while pursuing a hidden secondary objective.

Beyond academic benchmarks, the security industry has converged on the same warning. OWASP has ranked prompt injection #1 on its LLM Top 10 for the second consecutive year. Cisco and others now call it the new SQL injection, and the comparison is structurally accurate: a layer-violation problem where data the system was supposed to process becomes instructions the system executes — the layer that turns tool misuse and unauthorized data access from theoretical risks into reproducible exploits.

OWASP's own conclusion is that complete prevention isn't currently feasible.

Multimodal and multilingual surfaces

Visual jailbreaks now exploit several distinct mechanisms:

Typographic attacks like FigStep (Gong et al., AAAI 2025) render harmful queries as images to bypass text filters.
Generative attacks (HADES, VRP) and bi-modal attacks (BAP, JPS) show that visual instruction tuning can weaken the backbone LLM's safety alignment.

The multilingual surface is structural rather than incidental. Multiple studies converge on the same pattern:

Yong, Menghini, and Bach (2024) showed that translating unsafe English inputs into low-resource languages bypasses GPT-4 safeguards 79% of the time on AdvBench, on par with or surpassing state-of-the-art jailbreaking attacks.
Deng and colleagues (2024) reported that low-resource languages are roughly three times more likely to elicit harmful content than high-resource ones.
The ALERT benchmark's M-ALERT extension finds substantial defense-success-rate drops outside English across 5 languages and ALERT's 6 macro / 32 micro harm categories.

This is not an edge case. It is the predictable consequence of safety training data inequality, and it shows up every time a deployer launches a product in a market it didn't fine-tune for.

Why Are Public Adversarial Datasets Losing Reliability?

The Muse Spark 19.8% figure is the cleanest single piece of evidence, but it sits inside a larger pattern that has been building across the benchmark literature. Three findings bracket the problem:

Construct validity is failing across the field. Hardy, Reuel, and colleagues' systematic review of 445 LLM benchmarks at NeurIPS 2025, coordinated by 29 expert reviewers across leading ML and NLP venues, documents construct-validity failures across the field: vague task definitions, missing statistical tests, and repurposed datasets that undermine the validity of resulting claims. The review covers capability and safety benchmarks together, and the failure modes overlap.
Static scores understate real-world risk. HarmBench (Mazeika et al., ICML 2024) analyzed 18 attacks against 33 LLMs. No single model defends against the broad range of attacks. Scaling within a model family from 7B to 70B parameters does not guarantee improved robustness. And most pointedly: single-turn automated attacks like AutoDAN, GCG, and PAIR yield reassuringly low attack success rates while multi-turn human red teaming exposes failures up to 75% ASR.
Incidents are rising while RAI evaluations remain rare. Stanford HAI's 2025 AI Index, the annual report from Stanford's Institute for Human-Centered AI tracking the state of the field, reports in its Responsible AI chapter that AI-related incidents rose 56.4% year-over-year to 233 in 2024, while standardized RAI evaluations remain rare among major industrial model developers. The follow-up 2026 report puts the incident count at 362.

The capability-benchmark tables in frontier system cards are full; the safety-benchmark tables are usually empty or selective.

Put the three findings together and the pathology is clear:

The model has learned what an evaluation looks like (Muse Spark).
Many public adversarial datasets have construct-validity problems independent of any model behavior (Reuel et al.).
Static benchmark scores, even on well-constructed benchmarks, understate the failure rate that emerges under adaptive multi-turn pressure (HarmBench).

None of this means public datasets are useless. It means a score on a public adversarial dataset can no longer be the deployment decision. The same erosion is visible in capability evaluation, where domain-specific benchmarks have had to replace saturated public leaderboards. It is just as visible in how scores swing across model families, as our data story on open-weight coding models illustrates.

The argument is structurally identical to what has already happened with public capability benchmarks. Kili's AI Benchmarks Guide covers the capability case in detail; this guide is the adversarial counterpart. The fix in both cases is private, expert-built, traceable data refreshed against the deployer's actual use case.

How Do Frontier Labs Build Private LLM Red Teaming Datasets?

Look at any 2025–2026 system card or preparedness report — the lab-published safety documents that accompany frontier model releases — and you will find a network of named external organizations doing the work. The pattern repeats because the root cause is the same: no internal team alone has the breadth of expertise needed across every harm category.

The Muse Spark report names six external partners, each handling a domain specialty:

SecureBio: biosecurity
Apollo Research: alignment and scheming evaluations
Charlemagne Labs: social engineering
Frontier Design and Faculty: biosecurity consulting
Deloitte: additional review
Irregular: cyber

None of them is reproducing AdvBench.

OpenAI runs a formal Red Teaming Network of its own. The numbers from two of its public reports:

The GPT-4o System Card, the safety report for the model that powers ChatGPT, reports 100+ external red teamers across 29 countries speaking 45 different languages.
The Operator System Card, which documents OpenAI's web-browsing agent, reports 20 countries and 24 languages for the agent surface specifically.

The published process is documented in OpenAI's Approach to External Red Teaming for AI Models and Systems by Ahmad and colleagues. It defines a specific workflow:

Independent assessment
Scope-setting
Recruitment
Model access tiers
Report formats
Iterative deployment

External red teaming is described as particularly valuable for testing quickly evolving AI models.

Anthropic's 2024 update on Policy Vulnerability Testing describes the same shape from a different angle. PVT is in-depth qualitative testing with external subject-matter experts on specific Usage Policy topics. The named structure differs from OpenAI's network, but the underlying premise is identical: domain-expert humans construct probes that no automated attacker would think to write.

Meta's pipeline adds a methodological detail worth flagging. The Muse Spark cascaded attack approach uses a helpful-only model (produced by removing safety training) as a ceiling estimate for what the model is capable of in the absence of refusals, then compares the production model's behavior against that ceiling.

This separates capability from propensity, which is one of the harder distinctions in red teaming and one that public benchmarks routinely confound.

The shared infrastructure underneath all of this is increasingly Inspect, the UK AI Security Institute's open source framework released in May 2024 under MIT license. Inspect now has 50+ contributors including frontier labs and other AISIs, ships 200+ pre-built evaluations, and includes dedicated safeguards categories.

It is the first AI safety testing platform created by a state-backed body to be made freely available, and it has effectively become the de facto layer at which external red-teaming data is run.

The pattern is consistent across labs:

An internal red team
A network of named external domain specialists
A shared evaluation framework
A private adversarial dataset that almost never gets published

What Does Enterprise LLM Red Teaming Require to Keep Up?

Most enterprises deploying production LLMs are not Meta. They cannot stand up a SecureBio. But they face the same regulatory and reputational pressure, and they ship into the same attack surfaces:

Agentic tools and external services
Multi-turn dialogue
Multilingual user bases
Sensitive data domains

The question is what the practical replication looks like.

Microsoft's eight lessons from 100+ products, distilled in the same NeurIPS 2024 paper cited earlier, are the closest thing to a public reference framework. Lesson 5, the human element of AI red teaming is crucial, is the most consequential for enterprise teams. Automation expands coverage, but it cannot replace human ingenuity for prioritization, cultural context, subject-matter expertise, and emotional intelligence.

A serious enterprise red teaming program in 2026 needs five things:

A private adversarial dataset constructed by domain experts, refreshed at a cadence matched to the deployment risk.
Coverage of multi-turn dialogue, not just isolated prompts.
Agentic trajectories where the question is whether the agent's tool selection and recovery were appropriate at each step, not whether a single output was refused.
Multilingual probes constructed by native speakers in the languages the product actually ships into.
A harm taxonomy that maps to the deployer's policy, not to a public benchmark's categories.

Most teams discover the taxonomy point only after their first thorough evaluation. Public adversarial datasets ship with their own categorical schemes:

ALERT's 6 macro / 32 micro categories
Anthropic's Usage Policy categories
OpenAI's content policy categories
NIST AI 600-1's 12 GenAI risk categories
ML Commons AILuminate

None of them is the category set the deployer's compliance team uses. A taxonomy is only useful when it maps to the policy that will be enforced, and that mapping is labeling work: domain experts plus inter-annotator agreement plus calibration.

The same logic applies to severity rubrics. A "harm" label is not enough; production decisions need graded severity, and severity is irreducibly contextual. Inter-annotator agreement on harm severity is genuinely an open methodological problem in the academic literature: different benchmarks report different rates, and there is no clean cross-benchmark systematic study.

This is one of the places where the work has to happen at the deployer's scale, with the deployer's experts, on the deployer's taxonomy. It is also where best practices around layered testing converge: baseline manual prompts, enhanced automated attacks, controlled environments to contain potential harms during testing, and a "break-fix" cycle that feeds findings back to defenders so improved system prompts and AI guardrails actually reach the deployment pipeline.

This is the layer where Kili's first-party work lives. The English/French CommandR+ study from October 2024, Behind the Scenes: Evaluating LLM Red Teaming Techniques, is a direct demonstration of the multilingual finding the academic literature corroborates: the same model has different defense-success rates in different languages, and only native-speaker SMEs catch it cleanly.

The Kili Red Teaming Benchmark is the in-house evaluation surface for this style of work. The point is not the platform; it is that the work product is a labeled adversarial dataset with traceable provenance, not a public score.

What Does Regulation Actually Require?

The compliance bar has crystallized. Three documents define it:

EU AI Act (Regulation 2024/1689). Requires providers of general-purpose AI models with systemic risk to document internal and/or external adversarial testing under Annex XI Section 1, and to conduct and document adversarial testing of the model with a view to identifying and mitigating systemic risks under Article 55. Article 51 presumes systemic risk for any GPAI model trained at more than 10^25 FLOPs of cumulative compute. The trigger is mechanical, the documentation requirement is not optional, and the regulator wants to see process, not just a score.
NIST AI 600-1. The Generative AI Profile of the NIST AI RMF, published in July 2024. Names red teaming as a required risk-measurement activity across 12 GenAI risk categories. The guidance is explicit that models should be evaluated for their capabilities and red-teamed both before and after deployment.
UK and US AI Safety Institute frameworks. The UK AISI's Inspect framework, released in May 2024, is the open-source evaluation infrastructure that most external red teaming data now runs through. The US AI Safety Institute conducts pre-deployment evaluations on frontier models in collaboration with labs.

Beyond those three, the broader governance layer is filling in:

The OECD, EU, UN, and African Union have all published governance frameworks in the past two years.
OWASP's LLM Top 10 supplies the security-side compliance language that auditors increasingly cite.

The compounding effect is that adversarial testing is now a documented, reproducible activity with a paper trail. That favors private datasets with traceable provenance over a screenshot of a leaderboard score.

If a regulator asks how a deployer evaluated systemic risk, the answer needs to identify:

The dataset
The experts who built it
The taxonomy it covered
The attack methods it included
The evaluation framework it ran in

A public adversarial dataset can be part of that answer. It cannot be the whole answer.

What Do You Actually Need to Defend Against?

Defenses have layered out into three tiers, and red-team data validates every one of them. Each tier has its own mitigation strategies, but none of them works without adversarial test cases that reflect the actual threat model.

Model level

Current alignment work to address LLM vulnerabilities at the model level covers:

Reinforcement learning from human feedback (RLHF)
Direct preference optimization (DPO)
RL from AI feedback (RLAIF)
Anthropic's Constitutional AI

The data underneath all of these is graded human preference data plus targeted adversarial training examples. Without a private adversarial dataset to validate against, you cannot tell whether your model-level defense generalizes or just overfits to the public set you trained on. The same caveat applies to fine tuning runs intended to harden a base model: if the eval set is the same shape as the training set, the policy violations you'll catch are the ones you already knew about.

It is also worth noting what import bias looks like at this stage. Bias is a model weakness, and addressing it requires breaking it down into granular vulnerabilities — political, religious, racial, demographic — to design targeted adversarial tests. A flat "bias" label is too coarse to drive any mitigation.

Prompt level

The practical defenses now include:

Instruction hierarchy, which formalizes that system prompts outrank user prompts and tool outputs.
Spotlighting (Hines et al.), for marking untrusted content so the model can distinguish data from instructions.

Both depend on adversarial test cases that look like real injection attempts, not academic templates. The patterns matter: a defense calibrated on textbook examples will generalize poorly to the prompt injection attacks security professionals see in the wild.

System level

The agentic surface has produced its own defense layer:

Sandboxing through Inspect
AgentDojo's defense suite
DRIFT-style integrity monitoring
Tool permission scoping (especially for database access and external services)
Human-in-the-loop checkpoints for high-impact actions

None of these can be evaluated without a labeled set of agentic trajectories: multi-step traces where each step is graded for tool selection appropriateness, recovery behavior, and information leakage. This is where security teams spend the most time in 2026: not on individual prompt refusals, but on whether an agent acting across a deployment pipeline preserves the correct trust boundaries at every step. Excessive agency — agents granted more permissions than the task warrants — is the failure mode that turns a bad prompt into a real incident.

Microsoft's eighth lesson is the right note to close on: the work of securing LLM systems will never be complete. LLM red teaming is not a one-time certification. It is a recurring investment in private adversarial data, refreshed against new attacks, new modalities, and new deployment contexts.

The infrastructure cost is real. The alternative, relying on public scores that the model has learned to recognize, is no longer credible.

The Real Lesson of the 2026 Shift

The story of LLM red teaming since 2022 is a story about what the data has to be. Anthropic's original methodology was a research artifact. The current frontier-lab practice is an institutional one, organized around named external partner networks, shared evaluation frameworks, and private adversarial datasets that almost never get published.

Public benchmarks remain useful as a coarse comparability layer, but the Muse Spark 19.8% figure says directly what the broader benchmark literature has been saying indirectly: a model can recognize a public adversarial dataset, and once it can, the score it produces stops measuring what the score is supposed to measure.

The replication problem for everyone else is that three pressures are converging at the same time:

Regulatory bar. EU AI Act Article 55 wants documented adversarial testing.
Attack surface. AgentDojo, AgentHarm, and the Cisco "new SQL injection" framing tell you the agentic surface is wide open.
Public-benchmark erosion. AdvBench and HarmBench tell you what they can. Native-speaker SMEs in the deployment language tell you what AdvBench cannot.

The 2026 red team is the team that builds, labels, and maintains the dataset that closes those gaps.

What Reliable Adversarial Data Actually Requires

The argument in this guide has been that public adversarial datasets have lost their grip on real LLM security risks, and that the institutional answer — across Meta, OpenAI, Anthropic, and Microsoft — has been to invest in private, expert-built data with traceable provenance.

The same logic applies to enterprises shipping LLM systems into production. Reliable red teaming data is not a one-time artifact. It is an ongoing labeling program: a domain-expert workforce capable of constructing probes a generic crowdworker cannot, native speakers in the languages the product actually ships into, severity rubrics calibrated to the deployer's policy, and an evaluation surface where every adversarial input traces back to the expert who wrote it and the harm category it tests. Teams that want to stand this up themselves can adapt the same method we describe for how to build a custom benchmark, applied to adversarial rather than capability data.

This is the work Kili Technology is built for. Kili's network of 2,000+ verified domain specialists — including math olympiad champions, Lean 4 theorem provers, and practicing attorneys — supports adversarial dataset construction across more than 40 languages, with evaluation as a first-class service rather than a tool bolted onto annotation. For deployers documenting adversarial testing under EU AI Act Article 55 or NIST AI 600-1, the audit trail comes built in. Explore Kili's data labeling services to see how that workflow runs end to end.

Resources

Foundational and Methodology Papers

Red Teaming Language Models to Reduce Harms (Ganguli et al., 2022) – Anthropic's foundational red-teaming-with-crowdworkers methodology
- https://arxiv.org/abs/2209.07858
Challenges in Red Teaming AI Systems (Anthropic, 2024) – Policy Vulnerability Testing and the role of external SMEs
- https://www.anthropic.com/news/challenges-in-red-teaming-ai-systems
Lessons From Red Teaming 100 Generative AI Products (Bullwinkel et al., NeurIPS 2024) – Microsoft's eight institutional lessons
- https://arxiv.org/abs/2501.07238
OpenAI's Approach to External Red Teaming for AI Models and Systems (Ahmad et al., 2025) – Process documentation for OpenAI's Red Teaming Network
- https://arxiv.org/abs/2503.16431

Attack Methods and Adversarial Benchmarks

AgentDojo (Debenedetti et al., NeurIPS 2024) – Dynamic environment for prompt injection in tool-calling agents
- https://arxiv.org/abs/2406.13352
AgentHarm (Andriushchenko et al., 2024) – Benchmark for measuring harmfulness of LLM agents
- https://arxiv.org/abs/2410.09024
Crescendo: Multi-Turn LLM Jailbreak Attack (Russinovich, Salem & Eldan, 2024) – Microsoft's gradual-escalation attack
- https://arxiv.org/abs/2404.01833
Tree of Attacks with Pruning (Mehrotra et al., NeurIPS 2024) – Black-box automated jailbreak via attacker/evaluator LLMs
- https://arxiv.org/abs/2312.02119
Many-Shot Jailbreaking (Anil et al., NeurIPS 2024) – Long-context attack scaling to 256+ shots
- https://www.anthropic.com/research/many-shot-jailbreaking
HarmBench (Mazeika et al., ICML 2024) – Standardized framework for automated red teaming with 18 attacks vs 33 LLMs
- https://arxiv.org/abs/2402.04249
ALERT (Tedeschi et al., 2024) – Comprehensive safety benchmark with 6 macro / 32 micro harm categories
- https://arxiv.org/abs/2404.08676
FigStep: Typographic Visual Jailbreak (Gong et al., AAAI 2025) – Visual jailbreak via rendered-text attacks
- https://arxiv.org/abs/2311.05608
Low-Resource Languages Jailbreak GPT-4 (Yong, Menghini & Bach, 2024) – 79% bypass rate via low-resource translation
- https://arxiv.org/abs/2310.02446
Multilingual Jailbreak Challenges in LLMs (Deng et al., 2024) – Cross-lingual safety inconsistency at scale
- https://arxiv.org/abs/2310.06474

Benchmark Validity and Market Context

Measuring What Matters: Construct Validity in LLM Benchmarks (Hardy, Reuel et al., NeurIPS 2025) – Systematic review of 445 benchmarks
- https://arxiv.org/abs/2511.04703
Stanford HAI 2025 AI Index — Responsible AI Chapter – AI incidents and RAI evaluation adoption data
- https://hai.stanford.edu/ai-index/2025-ai-index-report/responsible-ai

Lab Safety Reports and System Cards

Muse Spark Safety and Preparedness Report (Meta Superintelligence Labs, 2026) – The 19.8% evaluation-awareness finding
- https://ai.meta.com/static-resource/muse-spark-safety-and-preparedness-report/
GPT-4o System Card (OpenAI, 2024) – External red teaming network details
- https://openai.com/index/gpt-4o-system-card/
Operator System Card (OpenAI, 2025) – Agent-specific external red teaming
- https://openai.com/index/operator-system-card/

Regulation and Standards

EU AI Act Annex XI (Regulation 2024/1689) – Adversarial testing documentation requirements for GPAI with systemic risk
- https://artificialintelligenceact.eu/annex/11/
NIST AI 600-1: Generative AI Profile (NIST, 2024) – Red teaming as a required risk-measurement activity
- https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
Inspect AI (UK AI Security Institute, 2024) – Open-source evaluation framework
- https://github.com/UKGovernmentBEIS/inspect_ai
OWASP Top 10 for LLM Applications – Prompt injection #1 for the second consecutive year
- https://owasp.org/www-project-top-10-for-large-language-model-applications/

Industry Commentary

Prompt Injection Is the New SQL Injection (Cisco Blogs, 2026) – Security-side framing of the agentic attack surface
- https://blogs.cisco.com/ai/prompt-injection-is-the-new-sql-injection-and-guardrails-arent-enough

Kili Resources

Behind the Scenes: Evaluating LLM Red Teaming Techniques (Kili, October 2024) – English/French CommandR+ study
- https://kili-technology.com/blog/behind-the-scenes--evaluating-llm-red-teaming-techniques-and-categories
Kili Red Teaming Benchmark – In-house evaluation surface for adversarial datasets
- https://llmbenchmark.kili-technology.com
AI Benchmarks Guide 2026 (Kili) – Capability-benchmark counterpart to this guide
- https://kili-technology.com/blog/ai-benchmarks-guide-the-top-evaluations-in-2026-and-why-theyre-not-enough

‍

Frequently Asked Questions

What is LLM red teaming?

LLM red teaming is the practice of systematically probing large language models for vulnerabilities — safety failures, harmful outputs, jailbreaks, factual errors, bias, and unexpected behaviors. Red teamers craft adversarial inputs designed to expose weaknesses that standard evaluation doesn't catch.

Why is LLM red teaming important in 2026?

As LLMs are deployed in higher-stakes contexts (healthcare, finance, legal, government), the consequences of safety failures have grown. Simultaneously, public adversarial datasets have become less reliable because models are increasingly trained to pass them. Private, continuously updated red teaming is now a regulatory and operational necessity.

What attack methods do red teamers use?

Common techniques include prompt injection, jailbreaking through role-play or hypothetical framing, multi-turn manipulation, encoding attacks (base64, character substitution), context window poisoning, and adversarial suffixes. Red teamers also test for bias, hallucination under pressure, and information leakage.

What regulations require LLM red teaming?

The EU AI Act requires risk assessments and adversarial testing for high-risk AI systems. The US Executive Order on AI (October 2023) mandated red teaming for frontier models. NIST's AI Risk Management Framework includes adversarial testing guidance. Industry-specific regulators in finance and healthcare increasingly expect documented adversarial evaluation.

How do frontier labs build red teaming datasets?

Frontier labs maintain private red teaming teams (internal and contracted) who continuously generate novel adversarial prompts. These datasets are never published to prevent models from being trained to pass them. Labs like Anthropic, OpenAI, and Google DeepMind also run external red teaming programs with domain specialists and security researchers.

Can enterprises do their own LLM red teaming?

Yes, and increasingly they should. Enterprise red teaming focuses on domain-specific risks: can the model be tricked into revealing confidential information, generating harmful advice in your specific context, or producing biased outputs for your user population? This requires labeled adversarial datasets built by people who understand your deployment context.

What does Kili Technology offer for red teaming?

Kili Technology provides the data infrastructure for building and managing adversarial evaluation datasets. Teams use the platform to create labeled prompt-response pairs for red teaming, run multi-reviewer quality workflows on safety evaluations, and maintain versioned adversarial test sets that evolve as models improve.

Build Adversarial Evaluation Data That Keeps Up With Your Models

Kili Technology provides the annotation and evaluation infrastructure for building, managing, and iterating on adversarial test sets. Create labeled red teaming datasets with expert reviewers, track model safety over time, and maintain the evaluation rigor that production AI demands.

Start building your red teaming pipeline →

Subscribe for updates

Stay updated with the latest news, articles and update directly into your box

July 22, 2026

Kimi K3's Benchmarks and Hallucinations — What That Tells Us About AI Evaluation

Kimi K3 ranked third on the AI Intelligence Index while its hallucination rate hit 51%. Here is what that paradox reveals about how the industry evaluates models.

Kili Technology

AI Evaluation

Foundation Models

July 15, 2026

Best On-Premise Data Labeling Platforms for Regulated Industries [2026] Guide

Compare the best on-premise data labeling platforms for defense, healthcare, and finance in 2026. This guide evaluates secure deployment models, certifications (SOC 2, ISO 27001, HIPAA), air-gapped operations, and quality-at-scale for teams labeling sensitive AI training data.

Kili Technology

Data Labeling

July 15, 2026

Introduction EU AI Act: What Every AI Team Needs to Know Before August 2026

The EU AI Act regulates AI applications by risk level, assigning obligations to every organisation that develops or deploys AI systems affecting people in the EU. This guide covers what the Act requires, who is in scope, which use cases are affected, and the enforcement timeline your team should be working against.

Kili Technology

Foundation Models

AI Evaluation

Data Labeling

LLM Red Teaming in 2026: How Frontier Labs Stress-Test AI Models (And Why Public Benchmarks Are No Longer Enough)

Table of contents

AI Summary

Introduction

What Is LLM Red Teaming?

How does red teaming look different in 2026 than in 2022?

What Attack Methods Do Red Teamers Use Against LLMs in 2026?

Multi-turn and many-shot attacks

Agentic prompt injection

Multimodal and multilingual surfaces

Why Are Public Adversarial Datasets Losing Reliability?

How Do Frontier Labs Build Private LLM Red Teaming Datasets?

What Does Enterprise LLM Red Teaming Require to Keep Up?

What Does Regulation Actually Require?

What Do You Actually Need to Defend Against?

Model level

Prompt level

System level

The Real Lesson of the 2026 Shift

What Reliable Adversarial Data Actually Requires

Resources

Foundational and Methodology Papers

Attack Methods and Adversarial Benchmarks

Benchmark Validity and Market Context

Lab Safety Reports and System Cards

Regulation and Standards

Industry Commentary

Kili Resources

Frequently Asked Questions

What is LLM red teaming?

Why is LLM red teaming important in 2026?

What attack methods do red teamers use?

What regulations require LLM red teaming?

How do frontier labs build red teaming datasets?

Can enterprises do their own LLM red teaming?

What does Kili Technology offer for red teaming?

Build Adversarial Evaluation Data That Keeps Up With Your Models

Subscribe for updates

Related articles

Kimi K3's Benchmarks and Hallucinations — What That Tells Us About AI Evaluation

Best On-Premise Data Labeling Platforms for Regulated Industries [2026] Guide

Introduction EU AI Act: What Every AI Team Needs to Know Before August 2026

Ready when you are. Start your free trial.