Case study

AI Research Organization: Advancing Formal Theorem Proving Through Large-Scale Data Annotation

How a leading AI research organization partnered with Kili Technology to formalize approximately 2,500 mathematical problems across AMC, AIME, and AMO difficulty levels, enabling breakthrough advances in automated theorem proving with contributions from 68 international researchers.

Summary

Challenges

A leading AI research organization needed to formalize approximately 2,500 mathematical competition problems across AMC, AIME, and AMO difficulty levels to create training data for automated theorem proving models.

Results

The collaboration produced a comprehensive formal mathematics dataset with approximately 2,500 formalized problems completed in around 3,000 hours, demonstrating exceptional efficiency for complex mathematical work. This dataset enabled development of advanced theorem proving models that achieved 92.2% pass rate on industry-standard benchmarks—among the highest performance levels in automated theorem proving. The resulting research outputs now serve as foundational infrastructure for the formal mathematics research community worldwide.

Solution

The organization partnered with Kili Technology to connect them with Lean4 experts and researchers, and build a collaborative formalization infrastructure supporting distributed expert contribution at scale. Kili's project management team developed a motivated Lean4 community while its platform enabled seamless coordination among contributors across continents, managing multi-difficulty problem workflows while ensuring quality through integrated validation processes.

Challenge

A pioneering AI research organization faced a formidable challenge in advancing automated theorem proving capabilities. To train state-of-the-art AI models capable of solving complex mathematical problems, they needed to create an extensive dataset of formally verified mathematical proofs—a task requiring both mathematical expertise and rigorous formalization standards.

The mathematical formalization challenge was immense: Creating training data for formal theorem proving requires translating natural language mathematical problems into Lean 4, a formal verification language. This process demands deep mathematical knowledge, programming expertise, and meticulous attention to logical precision. Each problem must be not only correctly formalized but also verified to ensure mathematical soundness.

Traditional approaches weren't scalable. Manual formalization by a small team would take years to produce sufficient training data. The organization needed to coordinate efforts across multiple expert contributors while maintaining consistency in formalization standards, handling problems of varying difficulty levels, and ensuring quality control across thousands of mathematical statements.

The stakes were research leadership. The organization aimed to advance the state-of-the-art in automated theorem proving—a field where training data quality directly determines model performance. Their goal was to develop AI systems capable of solving problems at International Mathematical Olympiad difficulty, requiring training datasets that captured mathematical reasoning at the highest levels.

The challenge extended beyond volume to coordination: managing 68 contributors from 15+ countries, predominantly PhD students and active researchers, while ensuring consistency and quality across all formalizations. The organization needed infrastructure that could support distributed collaboration at scale.

Solutions

The organization partnered with Kili Technology to build a comprehensive mathematical formalization infrastructure, coordinating expert contributors across continents to create high-quality training data for breakthrough AI research.

Global Expert Network CoordinationKili Technology's platform enabled seamless collaboration among 68 international contributors, including PhD students and active researchers from 15+ countries. The platform's collaborative features allowed experts to work simultaneously on different problem sets while maintaining consistent formalization standards across the entire dataset.

Multi-Difficulty Problem ManagementThe project encompassed problems across three distinct difficulty levels—AMC (American Mathematics Competitions), AIME (American Invitational Mathematics Examination), and AMO (American Mathematical Olympiad). Kili's flexible workflow management enabled the team to organize problems by difficulty, assign appropriate experts to each category, and track progress across all complexity tiers.

Rigorous Quality Assurance ProcessGiven the critical importance of correctness in formal mathematics, the platform supported multiple validation stages. Contributors formalized problems in Lean 4, with each statement undergoing peer review and compiler verification. This multi-layered quality control ensured that all 2,490 formalized problems met the highest standards of mathematical rigor.

Efficient Time InvestmentThe distributed annotation infrastructure allowed the team to complete the formalization of approximately 2,500 problems in around 3,000 hours—a remarkable efficiency given the mathematical complexity involved. The platform's user-friendly interface reduced the technical overhead of collaboration, allowing experts to focus on mathematical formalization rather than coordination logistics.

Integrated Development WorkflowThe resulting dataset served as the foundation for training advanced AI models, with seamless integration between the annotation platform and the research team's machine learning pipeline. Problems were systematically extracted, validated, and incorporated into training datasets for multiple model variants.

Outcome

The partnership delivered transformational results, producing one of the largest publicly available formal mathematics datasets and enabling state-of-the-art advances in automated theorem proving.

Groundbreaking Dataset Creation The collaboration produced a comprehensive formal mathematics dataset containing approximately 2,500 formalized mathematical problems spanning AMC, AIME, and AMO difficulty levels.

Global Research Collaboration The project successfully coordinated 68 expert contributors from 15+ countries, with Kili's platform being the backbone supporting distributed research collaboration across international boundaries. The predominantly academic contributor base—featuring PhD students and active researchers—ensured deep mathematical expertise in every formalization.

State-of-the-Art AI Model Performance The training data enabled development of a breakthrough theorem proving model suite, achieving a 92.2% pass rate on the miniF2F benchmark—among the highest performance levels in automated theorem proving to date. The model suite includes multiple variants optimized for different computational requirements, each demonstrating strong formal reasoning capabilities.

Case studies

Learn more from our customers

Kili Technology helped these teams build high-quality data workflows for their high-performing models

AI Research Organization: Advancing Formal Theorem Proving Through Large-Scale Data Annotation

How a leading AI research organization partnered with Kili Technology to formalize approximately 2,500 mathematical problems across AMC, AIME, and AMO difficulty levels, enabling breakthrough advances in automated theorem proving with contributions from 68 international researchers.

Learn more →

Jellysmack: Scaling AI performance 5x to match exponential growth

How social media giant Jellysmack transformed their NLP and Video ML capabilities, shipping models 10x faster while reducing project management overhead by 50% to support 300+ content creators across 11 countries.

Learn more →

How Eidos-Montréal uses machine learning to leverage the Voice of the Customer and make better strategic decisions

Eidos-Montréal, a leading video game studio, wanted to better understand how players and critics perceived their games. The Data Science and Analytics team set out to turn thousands of customer reviews and articles into actionable insights to guide both business and game design decisions. To achieve this, they needed a scalable and reliable way to annotate large volumes of unstructured text data — and found the right partner in Kili Technology.

Learn more →