Data Story: How MiniMax M3 Reversed Its Own Engineering Decision — and What That Reveals About Training Data
MiniMax shipped four consecutive models on full attention, publicly stating that sparse attention wasn't ready for production. Then M3 arrived with a new sparse attention architecture, 100T+ pre-training tokens, and the top score among open-weight models. The reversal tells a data engineering story that matters more than the benchmarks.

.png)


![8 Best Data Labeling Platforms for Large-Scale Annotation [2026]](https://cdn.prod.website-files.com/68da32b2041c593b0511a582/6a340ef16e870a66feb5fe71_1.webp)













.webp)


