IMProofBench Leaderboard
Evaluating AI Systems on Uncontaminated Research-Level Mathematics
What is IMProofBench?
IMProofBench is a benchmark designed to measure whether AI systems can produce rigorous mathematical proofs at the level of professional mathematicians. We maintain a private, uncontaminated problem set sourced from active mathematical research, ensuring models are tested on genuinely novel problems they haven't seen during training.
Key Features: 🔒 Private problems prevent data contamination • 👨🏫 Human expert grading by mathematicians • ✓ Focus on proof correctness, not just answers • 🔄 Regular updates with new problems
43
Active Problems
10
Models Evaluated
Complete Solution Rate
What this measures: The percentage of problems where each model produced a complete and correct mathematical proof. Problems are graded by expert mathematicians on a 0-3 scale, where a score of 3 indicates a complete solution.
- 3 (Complete Solution): The model provided a fully correct mathematical proof
- 2 (Major Progress): Significant progress with key insights, but incomplete
- 1 (Minor Progress): Some correct steps or partial understanding
- 0 (No Progress): No meaningful progress toward the solution
The chart below shows what percentage of problems each model achieved at each progress level.
Verifiable Subproblem Performance
What this measures: How well models solve specific, automatically-verifiable components of larger problems. These subproblems test precise mathematical calculations and logical reasoning that can be checked without human review.
Examples: Computing specific values, verifying formulas, checking special cases, or determining truth values of modified statements. Scores represent the percentage of available points earned across all subproblems.
Head-to-Head Win Rates
What this measures: Direct performance comparison between models. Each cell shows how often the row model outperformed the column model on problems they both attempted. Higher numbers indicate better relative performance.
How to read: Find a model in the rows (↓) and another in the columns (→). The number shows on how many problems the row model achieved a higher weighted subquestion score than the column model. Green shading indicates the row model generally outperforms the column model, while red indicates the opposite.
Total Subquestions: 102 subquestions