AI Summary
- DeepSeek V4-Pro and V4-Flash were pre-trained on 33T and 32T tokens respectively — more than double V3's 14.8T — with deliberate emphasis on long documents and agentic execution traces
- CSA/HCA cuts KV cache to 10% of V3.2 levels at 1M context, making long-form training data economically viable to exploit at scale
- The post-training pipeline replaces reinforcement learning with On-Policy Distillation from 10+ domain specialists — quality of each specialist's training data directly determines the unified model's final capabilities
- mHC is confirmed in V4 as described in the December 2025 paper; Engram conditional memory is absent from V4
- Codeforces rating of 3206 for V4-Pro-Max makes this the first open model to match a closed-source model on competitive programming
Editor's Note (May 2026): This article has been substantially updated following the official release of the DeepSeek V4 technical report. An earlier version of this piece analysed three research papers — on Engram conditional memory, manifold-constrained hyper-connections (mHC), and DeepSeek Sparse Attention — that were widely expected to form V4's architectural basis. The confirmed architecture diverges significantly from those anticipations: Engram is absent, the attention mechanism is an original CSA/HCA hybrid rather than the leaked "MODEL1" sparse attention, and the post-training pipeline replaces reinforcement learning entirely with On-Policy Distillation. Sections that were speculative have been rewritten with confirmed specifications and benchmarks.
DeepSeek V4's technical report reveals a training data strategy built for million-token contexts: 32T+ tokens, specialist domain experts trained independently, and a distillation pipeline that merges ten teacher models into one. This analysis breaks down what the data decisions mean for practitioners building their own AI systems.
Introduction
The DeepSeek V4 technical report, released with open model checkpoints, settles months of speculation with concrete numbers. DeepSeek-V4-Pro carries 1.6 trillion parameters with 49 billion activated per token. DeepSeek-V4-Flash runs at 284 billion parameters with 13 billion activated. Both support one-million-token context windows. And both were pre-trained on more than 32 trillion tokens of curated data.
That last number deserves attention. DeepSeek V3 trained on 14.8 trillion tokens. V4 more than doubled the training corpus while simultaneously redesigning the architecture to handle sequences 15x longer than its predecessor's practical limit. The cost of processing those tokens at million-token scales would have been prohibitive under V3's architecture — which is precisely why V4 exists.
The architectural headlines — Compressed Sparse Attention, Heavily Compressed Attention, Manifold-Constrained Hyper-Connections, the Muon optimizer — are engineering achievements that make the data usable at scale. At a million tokens of context, V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache relative to DeepSeek-V3.2. But architecture enables; data determines. The more revealing story is in how DeepSeek structured its 32 trillion tokens, how it trained separate domain specialists before merging them, and what its evaluation methodology tells us about where model quality actually comes from.
Note on earlier coverage: Our March 2026 article analysed three research papers — on Engram conditional memory, mHC, and DeepSeek Sparse Attention — that were widely expected to form V4's architectural basis. The actual technical report confirms that mHC and a form of sparse attention made it into V4. Engram did not. The memory-compute separation thesis from that earlier piece remains architecturally interesting, but V4 took a different path: instead of separating knowledge from reasoning at the parameter level, DeepSeek separated them at the training pipeline level — building distinct domain experts and merging their capabilities through distillation. The data implications, as it turns out, are just as profound.
Key Takeaways
- V4 pre-trains on 32T+ tokens with deliberate emphasis on long-document data — scientific papers and technical reports that carry unique academic value at million-token scales
- DeepSeek filters web data to remove auto-generated and templated content, explicitly to prevent model collapse from low-diversity training distributions
- Post-training replaces mixed reinforcement learning with a two-stage pipeline: independent domain specialists (math, coding, agent, instruction following) trained via SFT and RL, then merged through on-policy distillation across 10+ teacher models
- A Generative Reward Model replaces scalar reward models for hard-to-verify tasks, cutting the dependence on large-scale human annotation while still requiring rubric-guided evaluation data
- Real-world evaluation on 200+ internal R&D tasks from 50+ engineers shows V4-Pro approaching Claude Opus 4.5 on practical coding work — not just benchmarks
What Changed in DeepSeek V4's Pre-Training Data Compared to V3?
DeepSeek V3 built its 14.8 trillion-token corpus with an emphasis on mathematical and programming content — 60% code, 10% math, 30% natural language for the Coder-V2 variant. V4 starts from that foundation and pushes in four directions.
The first is scale: 32 trillion tokens for V4-Flash, 33 trillion for V4-Pro. The V4 technical report describes a corpus comprising math, code, web pages, long documents, and other high-quality categories.
The second is long-document curation. V4 places particular emphasis on scientific papers, technical reports, and materials that the report describes as reflecting "unique academic values." This isn't generic web scraping extended to longer sequences. DeepSeek specifically curated documents whose value increases with length — the kind of material where a model genuinely needs to track arguments, cross-reference findings, and synthesise conclusions across tens of thousands of tokens. Shorter, repetitive web content provides diminishing returns at million-token training lengths. Dense academic text does not.
The third is quality filtering against model collapse. V4 implements strategies to remove batched auto-generated and templated content from web-sourced data. This directly addresses a documented risk: when language models train on their own outputs (or outputs of similar systems), performance degrades across generations. DeepSeek cites the model collapse research and treats it as an engineering constraint, not a theoretical concern. The filtering pipeline is a data quality decision with architectural consequences — a model trained on recycled AI-generated text would lose exactly the diversity that makes 32 trillion tokens valuable.
The fourth is task-specific data injection during training itself. DeepSeek incorporates agentic data during what it calls the "mid-training phase" — a period where the model transitions from general pre-training to more structured capability building. This means tool-use patterns, multi-step workflows, and environment-interaction sequences are baked into the base model before post-training begins, rather than being layered on exclusively through fine-tuning. V4 also expands its multilingual corpus to capture long-tail knowledge across different cultures.
One procedural change matters for data utilisation: V4 uses sample-level attention masking during pre-training. Documents from different sources are packed into sequences to minimise truncation, but the model cannot attend across document boundaries within a packed sequence. This prevents the model from learning spurious cross-document associations while maximising the number of useful tokens per training step.
Why Does DeepSeek Train Domain Specialists Separately Before Merging Them?
This is the most consequential data decision in the V4 pipeline, and it departs from how most frontier labs describe their post-training.
V4's post-training follows a two-stage paradigm. In the first stage, DeepSeek trains independent specialist models — one for mathematics, one for coding, one for agent tasks, one for instruction following, and others across additional domains. Each specialist starts from the same pre-trained base and goes through its own Supervised Fine-Tuning (SFT) on domain-specific data, followed by Reinforcement Learning using Group Relative Policy Optimization (GRPO) with domain-tailored reward signals. The result is a collection of models that each excel in their respective fields.
In the second stage, these specialists become teachers. A single unified student model learns from all of them simultaneously through On-Policy Distillation (OPD). The student generates its own outputs, then optimises against the full logit distributions of whichever teacher is most relevant to the current task context. More than ten teacher models participate in this process.
The data implications are significant. Each specialist requires its own high-quality SFT dataset: verified mathematical proofs for the math expert, executable code solutions for the coding expert, multi-step tool-interaction traces for the agent expert, precisely formatted responses for the instruction-following expert. These datasets don't need to be balanced against each other — each specialist trains on whatever distribution maximises its domain performance. The balancing happens during distillation, where the student learns to route between experts implicitly.
This approach also changes the economics of data curation. Traditional mixed RL requires a single dataset that somehow represents all capabilities simultaneously, which creates competing pressures: more reasoning data improves math but may hurt style; more code data helps programming but can degrade conversation. Specialist training eliminates these tradeoffs. Each data pipeline optimises for one thing, independently, without compromise.
How Does the Generative Reward Model Change the Annotation Requirements?
For tasks with clear verification signals — math problems with deterministic answers, code that compiles and passes tests — reinforcement learning is straightforward. The reward function is a program. No human annotator needed.
For everything else, V4 introduces a Generative Reward Model (GRM). Instead of training a separate scalar reward model from human preference data, DeepSeek curates rubric-guided RL data and uses the model itself as the evaluator. The actor network functions as both the generator and the judge, scoring its own outputs against structured rubrics.
This doesn't eliminate human involvement — it restructures it. The rubric creation requires domain expertise: someone who understands what constitutes a good legal summary, a well-structured financial analysis, or a nuanced medical explanation needs to define the evaluation criteria. But once the rubric exists, the GRM can evaluate thousands of trajectories without per-sample human scoring. DeepSeek describes this as achieving "superior performance with only a minimal set of diverse human annotations."
The shift matters for data labeling practitioners. Traditional RLHF scales linearly: more preference pairs require more annotator hours. GRM front-loads the expertise into rubric design, then amortises that investment across a much larger set of evaluations. The annotation task changes from "which response is better?" repeated thousands of times to "what does 'better' mean for this domain?" answered once, precisely, by someone qualified to answer it.
The V4 report notes that the GRM's evaluative and generative capabilities are jointly optimised — the model gets better at judging as it gets better at generating. This creates a feedback loop where reasoning capabilities are "inherently fused into the evaluative process." The quality of the initial rubric data sets the ceiling for this loop.
What Does V4's On-Policy Distillation Reveal About Training Data Quality?
DeepSeek made a specific engineering choice in how it runs distillation that reveals a data quality conviction. Prior approaches to on-policy distillation typically simplify the full-vocabulary KL divergence into a token-level estimate at each position. This is computationally cheap but produces high-variance gradient estimates and frequent training instability.
V4 uses full-vocabulary logit distillation instead. At every token position, the student compares its entire output distribution against the teacher's entire output distribution — all 128,000+ vocabulary entries. This is expensive. DeepSeek built an entire infrastructure subsystem to make it tractable: teacher weights offloaded to distributed storage and loaded on demand; last-layer hidden states cached and prediction heads reconstructed on the fly; training samples ordered by teacher index so each teacher head loads only once per mini-batch.
The engineering cost is justified because the data signal is better. Full-vocabulary distillation captures the teacher's complete knowledge at each step: which tokens it considers plausible, how probability mass distributes across alternatives, where the teacher is confident versus uncertain. Token-level estimates discard this information, keeping only whether the student's top prediction matches the teacher's. For a student learning from ten specialists simultaneously, the full distribution carries exactly the nuanced information needed to learn when to reason like the math expert, when to generate like the coder, and when to follow instructions like the style specialist.
The quality threshold here is unforgiving. If a specialist teacher produces unreliable outputs — a math expert that occasionally gets proofs wrong, a coding expert that generates subtly buggy solutions — the full logit distribution faithfully transmits that unreliability to the student. Bad teacher data doesn't get averaged away; it gets precisely encoded.
How Does V4 Handle the Three Reasoning Effort Modes?
V4 introduces a structured approach to reasoning effort that has direct data implications. Both V4-Pro and V4-Flash support three modes: Non-think (fast, intuitive responses), Think High (conscious logical analysis with explicit reasoning), and Think Max (extended reasoning with no shortcuts).
Each mode was trained with distinct RL configurations — different length penalties, different context windows. Think Max uses a 384K-token context window during evaluation, compared to 8K for Non-think. The models are separate specialists merged into one, distinguished by response format conventions (the <think> and </think> tokens) and, for Think Max, a specific system prompt instruction that demands exhaustive deliberation.
This means the training data for each mode has different characteristics. Non-think data rewards concise, correct answers without reasoning traces. Think High data rewards structured chain-of-thought with reliable conclusions. Think Max data rewards exhaustive exploration: documenting rejected hypotheses, stress-testing edge cases, and leaving no assumption unchecked. The V4 report shows the payoff — on the hardest benchmarks like Apex Shortlist, Think Max scores 90.2% versus Think High's 85.5%. On HLE (Humanity's Last Exam), the gap is 37.7% (Max) versus 34.5% (High).
The existence of three distinct data tracks for reasoning effort reflects an assumption about deployment: different tasks warrant different computational budgets, and a single model should be able to operate across all of them. The data curation challenge is that "think harder" isn't just "think longer" — it requires training examples that demonstrate genuinely deeper analysis, not merely more verbose analysis.
What Do V4's Real-World Evaluations Tell Us About Benchmark Limitations?
The V4 report is unusually candid about the gap between benchmarks and real usage. After presenting standard benchmark results — V4-Pro-Max scores 57.9% on SimpleQA Verified, 93.5% on LiveCodeBench, a Codeforces rating of 3206 — the report devotes an entire section to real-world performance that standard benchmarks miss.
For Chinese writing, DeepSeek ran pairwise comparisons across 3,170 functional writing tasks and 2,837 creative writing tasks. V4-Pro beat Gemini-3.1-Pro with a 62.7% win rate on functional writing and 77.5% on writing quality for creative tasks. But on the hardest subset — complex instruction following and multi-turn writing — Claude Opus 4.5 beat V4-Pro with a 52.0% win rate. Standard benchmarks don't distinguish between easy and hard writing scenarios; real evaluation does.
For coding, DeepSeek curated approximately 200 tasks from 50+ internal engineers across PyTorch, CUDA, Rust, and C++ — feature development, bug fixing, refactoring, and diagnostics from actual R&D work. After quality filtering, 30 tasks formed the evaluation set. V4-Pro-Max scored 67% pass rate, compared to Claude Sonnet 4.5's 47% and Claude Opus 4.5's 70%. In a survey of 85 DeepSeek developers using V4-Pro daily, 52% said it was ready as their primary coding model, 39% leaned yes, and fewer than 9% said no. Respondents noted "trivial mistakes, misinterpretation of vague prompts, and occasional over-thinking."
For white-collar enterprise tasks, DeepSeek built a suite of 30 advanced Chinese professional workflows spanning 13 industries — finance, education, law, technology — requiring document generation, information analysis, and document editing. Human evaluators scored V4-Pro-Max higher than Claude Opus 4.6 Max across task completion (98.32 vs 96.68) and content quality (83.32 vs 78.00), while trailing on instruction following (87.76 vs 88.88) and formatting aesthetics (76.68 vs 72.68).
These evaluations matter because they surface failure modes that MMLU and HumanEval never capture. V4's real-world evaluation infrastructure — internal engineering tasks, pairwise expert assessments, multi-dimensional scoring rubrics — is itself a data quality artifact that determines what gets optimised and what gets missed.
How Does V4's Agentic Training Data Shape Its Tool-Use Capabilities?
V4 introduces agentic data during pre-training, a shift from the typical approach of adding tool-use capabilities only during fine-tuning. The mid-training injection means the base model already understands multi-step tool interaction patterns before post-training begins.
The agent specialist then refines these capabilities through SFT and RL. V4 introduces a new XML-based tool-call schema using a special |DSML| token, replacing the JSON format used in earlier versions. The report notes that XML format reduces escaping failures and tool-call errors — a practical data formatting decision that affects error rates in production.
V4 also introduces "Quick Instruction" tokens: special tokens appended to input sequences that trigger auxiliary tasks in parallel — generating search queries, determining whether web search is needed, classifying intent. These tokens allow the model to handle auxiliary routing tasks by reusing the already-computed KV cache, eliminating redundant prefilling that previously required a separate small model. The training data for these tokens is a specialised dataset of task-specific signals.
On agent benchmarks, V4-Pro-Max achieves 80.6% on SWE-Verified (matching GPT-5.4's 80.6% and Claude Opus 4.6's 80.8%), 67.9% on Terminal Bench 2.0, and strong results on MCPAtlas and Toolathlon — benchmarks specifically testing generalisation across diverse tools and MCP services. The report notes this generalisation signals that V4 "does not perform well only on internal frameworks," addressing a common concern about agent evaluation validity.
For interleaved thinking in agentic workflows, V4 makes a specific data architecture decision: in tool-calling scenarios, all reasoning content is preserved across the entire conversation, including across user message boundaries. In general conversational scenarios, reasoning from previous turns is discarded. This distinction creates two different data generation patterns for the same model.
What Does the Infrastructure Tell Us About Data at Scale?
Several infrastructure decisions in the V4 report have direct data implications.
FP4 Quantization-Aware Training applies MXFP4 quantization to MoE expert weights and the indexer QK path during post-training. The model trains against quantised weights, so the final model's behaviour during deployment matches what was optimised during RL. This matters because the data the model sees during training and the precision at which it processes that data are now explicitly aligned — a source of evaluation-deployment mismatch that many labs leave unaddressed.
Deterministic and batch-invariant kernels ensure that training is bitwise reproducible. This isn't a data decision per se, but it determines how reliably data quality experiments can be run. When a training run is non-deterministic, isolating whether a performance change came from the data or from floating-point noise becomes difficult. DeepSeek's commitment to determinism means that data ablation studies — changing one variable in the training corpus and measuring the effect — produce reliable signals.
The sandbox infrastructure (DSec) manages hundreds of thousands of concurrent sandbox instances for agentic AI training and evaluation. Each sandbox maintains a trajectory log recording every command and its result, enabling deterministic replay and fine-grained provenance. The data generated by these sandboxes — agent interaction traces, tool-use patterns, error recovery sequences — feeds directly into the agent specialist's training pipeline.
The preemptible rollout service preserves partial generations through a token-granular Write-Ahead Log. When a training run is interrupted and restarted, the model resumes from where it stopped rather than regenerating from scratch. The report notes that regenerating introduces length bias — shorter responses survive interruptions more often, skewing the training distribution. This is a subtle but important data integrity measure: the distribution of training examples remains unbiased regardless of infrastructure disruptions.
What Are V4's Known Limitations and Where Does Data Quality Still Matter?
The V4 report is direct about its gaps. On knowledge benchmarks, V4-Pro-Max still trails Gemini-3.1-Pro on several measures — SimpleQA Verified shows 57.9% versus 75.6%, and GPQA Diamond shows 90.1% versus 94.3%. The report characterises V4's reasoning trajectory as "trailing state-of-the-art frontier models by approximately 3 to 6 months."
On agent tasks, V4-Pro-Max matches leading open-source models like Kimi-K2.6 and GLM-5.1 but falls short of frontier closed models. The architecture achieved competitive results on code-agent tasks (80.6% on SWE-Verified), but Terminal Bench 2.0 and GDPval-AA still show meaningful gaps relative to Claude Opus 4.6 and GPT-5.4.
Training stability remains partially unsolved. The report describes two stabilisation techniques — Anticipatory Routing (computing routing indices one step ahead to decouple backbone and routing updates) and SwiGLU Clamping — and acknowledges that "a comprehensive theoretical understanding of their underlying mechanisms remains an open question." They work empirically; why they work is unresolved.
Long-context retrieval performance degrades beyond 128K tokens. On the MRCR 8-needle benchmark, V4-Pro-Max scores 0.92 at 128K tokens but drops to 0.59 at 1M tokens. The compressed attention mechanisms that make million-token contexts feasible introduce an accuracy cost at the extremes.
V4-Flash's smaller parameter budget shows clear penalties on knowledge-intensive tasks — the gap between Flash and Pro is largest on factual recall benchmarks, where raw parameter count correlates with memorisation capacity. The report frames this as expected: "larger parameter counts facilitate greater knowledge retention during pre-training."
These limitations map to data opportunities. Knowledge gaps can be addressed with higher-quality factual data in fine-tuning. Agent performance depends on the diversity and difficulty of training interaction traces. Long-context degradation connects to the distribution of document lengths in pre-training data. In each case, the model's ceiling is set by the training data that reaches it.
Conclusion: The 32-Trillion-Token Argument
DeepSeek V4 makes an argument through its data pipeline that is more interesting than any single benchmark number. The argument is: domain expertise is separable, and the right way to combine it is through distillation rather than simultaneous training.
Training ten independent specialists, each with their own curated dataset and reward signal, then merging them through full-vocabulary distillation into a single model — this is an expensive, engineering-heavy approach to post-training. DeepSeek chose it because the data quality signal is better. Each specialist trains on an undiluted dataset. Each reward model evaluates against domain-appropriate criteria. The student model inherits all of it without the compromises of multi-objective optimisation.
For practitioners building on open-weight models, the implication is clear. V4-Pro's weights are downloadable. The architecture is documented. What distinguishes applications built on V4 is the quality of fine-tuning data — and the V4 report provides a blueprint for how that data should be structured. Domain-specific SFT datasets. Rubric-guided evaluation data. Multi-mode reasoning examples. Tool-interaction traces. Each of these represents a distinct data curation challenge requiring distinct expertise.
The models that perform best in production won't be those that start from the largest base model. They'll be those that bring the right domain specialists — human and model alike — to each stage of the training pipeline.
What Does Building Specialist Training Data Actually Require?
V4's specialist training pipeline works because each domain expert trains on data curated by people who understand what "correct" means in that domain. A math specialist needs proofs verified by mathematicians. A coding specialist needs solutions tested against real compilers and test suites. An agent specialist needs interaction traces evaluated by engineers who build the systems being tested.
This is the kind of work where a wrong label in the SFT dataset becomes a wrong behaviour in the specialist, which becomes a wrong logit distribution transmitted through distillation to the final model. The quality bar for specialist training data is higher than for general-purpose fine-tuning, because full-vocabulary distillation faithfully preserves both correct and incorrect signals.
If you're building domain-specific AI systems on open-weight foundations — financial analysis, medical reasoning, legal review, code generation — Kili Technology provides the combination of verified domain specialists and structured quality workflows that specialist training pipelines demand.
Resources
Technical Papers and Model Releases
- DeepSeek-V4 Technical Report and Model Checkpoints — Full technical documentation and downloadable weights
- https://huggingface.co/collections/deepseek-ai/deepseek-v4
- DeepSeek-V3 Technical Report — Predecessor architecture documenting the 14.8T token training pipeline
- https://arxiv.org/abs/2412.19437
- mHC: Manifold-Constrained Hyper-Connections — Training stability via Birkhoff Polytope projection, confirmed in V4
- https://arxiv.org/abs/2512.24880
- Engram: Conditional Memory via Scalable Lookup — Memory-compute separation research; not included in V4's final architecture
- https://arxiv.org/abs/2601.07372
- DeepSeek-R1: Incentivizing Reasoning in LLMs through Reinforcement Learning — Reasoning model whose techniques informed V4's specialist training
- https://doi.org/10.1038/s41586-025-09422-z
Infrastructure and Tools
- DeepGEMM / MegaMoE — Open-sourced CUDA kernel for fine-grained expert parallelism communication-computation overlap
- https://github.com/deepseek-ai/DeepGEMM/pull/304
- TileLang — Domain-specific language used for V4's kernel development
- Referenced in Wang et al., 2026 (ICLR)
- V4-Pro Inference Implementation — Open-source reference implementation for hybrid attention
- https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/tree/main/inference
Earlier Kili Coverage
- Data Story: A Deep Dive into DeepSeek V4 (Pre-Release Analysis) — Our March 2026 analysis based on pre-release research papers
- https://kili-technology.com/blog/data-story-deepseek-v4
.png)


