IMProofBench Leaderboard

Evaluating AI Systems on Uncontaminated Research-Level Mathematics

What is IMProofBench?

IMProofBench is a benchmark designed to measure whether AI systems can produce rigorous mathematical proofs at the level of professional mathematicians. We maintain a private, uncontaminated problem set sourced from active mathematical research, ensuring models are tested on genuinely novel problems they haven't seen during training.

Key Features: 🔒 Private problems prevent data contamination • 👨‍🏫 Human expert grading by mathematicians • ✓ Focus on proof correctness, not just answers • 🔄 Regular updates with new problems

43

Active Problems

10

Models Evaluated

Complete Solution Rate

What this measures: The percentage of problems where each model produced a complete and correct mathematical proof. Problems are graded by expert mathematicians on a 0-3 scale, where a score of 3 indicates a complete solution.

  • 3 (Complete Solution): The model provided a fully correct mathematical proof
  • 2 (Major Progress): Significant progress with key insights, but incomplete
  • 1 (Minor Progress): Some correct steps or partial understanding
  • 0 (No Progress): No meaningful progress toward the solution

The chart below shows what percentage of problems each model achieved at each progress level.

Verifiable Subproblem Performance

What this measures: How well models solve specific, automatically-verifiable components of larger problems. These subproblems test precise mathematical calculations and logical reasoning that can be checked without human review.

Examples: Computing specific values, verifying formulas, checking special cases, or determining truth values of modified statements. Scores represent the percentage of available points earned across all subproblems.

Color Scale: ≥80% ≥60% ≥40% ≥20% <20% N/A

Head-to-Head Win Rates

What this measures: Direct performance comparison between models. Each cell shows how often the row model outperformed the column model on problems they both attempted. Higher numbers indicate better relative performance.

How to read: Find a model in the rows (↓) and another in the columns (→). The number shows on how many problems the row model achieved a higher weighted subquestion score than the column model. Green shading indicates the row model generally outperforms the column model, while red indicates the opposite.

Total Questions: 43 questions
Total Subquestions: 102 subquestions

Frequently Asked Questions

Problems are contributed by professional mathematicians from active research areas. Each problem undergoes rigorous peer review to ensure it requires genuine mathematical reasoning and proof construction skills. We prioritize problems that test deep mathematical understanding rather than computational ability.

IMProofBench evaluations are conducted through an automated internal system to maintain consistency and prevent problem leakage. All models are given 24 hours per problem with up to 300,000 output tokens for the main question and 100,000 tokens per subquestion. Models have access to Python, SageMath, and web search capabilities to ensure fair comparison.

Each model's solution is graded by expert mathematicians who evaluate the correctness and completeness of the mathematical proof. Graders assess whether the logical steps are valid, the proof strategy is sound, and the conclusion correctly answers the question.

Keeping problems private is essential to prevent data contamination. Once problems become public, they can be included in training data for future models, invalidating the benchmark. Our private problem set ensures that models are tested on genuinely novel problems, providing a true measure of their mathematical reasoning capabilities rather than memorization.

We continuously add new problems as they pass our review process. Models are re-evaluated periodically on the growing problem set. This ensures the benchmark remains challenging and relevant as AI capabilities advance. Check back regularly for updated results and newly evaluated models.

IMProofBench focuses specifically on proof generation at research level, not just problem solving. Our problems require constructing rigorous mathematical arguments, not just finding answers. Human expert grading ensures we evaluate mathematical correctness, not just pattern matching.

Mathematicians can contribute problems through our submission system (requires account creation and verification). We're particularly interested in problems from active research areas that test deep mathematical reasoning. You can also join our Zulip community to discuss the benchmark and stay updated on developments.