Frequently Asked Questions

Common questions about IMProofBench and how to contribute

About the Benchmark

Understanding what IMProofBench is and how it works

What's the goal of the IMProofBench project?
IMProofBench aims to track progress of AI on mathematical reasoning as it appears in research-level problems. We stay close to the open-ended nature of problems from modern mathematics research, provide equitable access to all AI companies for evaluation, and maintain private questions to avoid overfitting and benchmark gaming.
How does IMProofBench differ from existing benchmarks?
IMProofBench focuses on proof generation at research level rather than just finding correct answers. Here's how we differ from existing approaches:
  • MATH and FrontierMath: Focus on unique numerical answers, which can sometimes be reached through shortcuts without genuine understanding
  • miniF2F: Requires formalized mathematics as output, limiting scope to problems that can be reasonably formalized
  • Recent proof-focused work like Math Olympiad evaluation and MathArena: Focus on high-school level olympiad mathematics rather than research-level problems
Our approach evaluates AI systems on their ability to produce complete, rigorous mathematical arguments that would satisfy peer-review standards at the graduate/research level, directly targeting AI weaknesses like hallucination while remaining authentic to modern research practice.
Which AI models are tested against the benchmark?
We test frontier AI models including ChatGPT o3, Claude Opus 4, Gemini 2.5 Pro, Grok 4, and other state-of-the-art reasoning systems. Models are evaluated in a multi-turn environment with access to advanced tools like SageMath and online search to simulate realistic research conditions. The specific models tested may evolve as new systems become available.
How are the AI-generated proofs graded?
TBD - still in flux on our design side.

Privacy & Data Protection

How we protect your contributed problems

Why is most of the dataset kept private?
We maintain benchmark integrity by preventing AI companies from training on test problems or optimizing specifically against our benchmark. As Goodhart's law states: "When a measure becomes a target, it ceases to be a good measure." A private dataset ensures unbiased evaluation and reliable capability measurement over time.
How do you prevent AI companies from using submitted problems for training?
Models are tested via API calls, and standard legal terms of major AI providers already include non-training policies. We plan to solicit additional zero data retention policies with the different providers once we start evaluating questions at scale.

Contribution & Testing

Benefits and process for contributors

Will I get co-authorship credit for contributing problems?
Yes! All contributors have the option to become co-authors on resulting publications. Author order may be tied to our contribution tracking system. Depending on the project's scale, multiple publications may emerge covering different aspects (grading systems, AI mistake analysis, etc.), providing multiple co-authorship opportunities.
Can I test my own problems against AI models before submitting?
Yes! Currently a single model is available for such testing (o4-mini with high reasoning effort and access to web search and a code interpreter). This is the current leading model on the related FrontierMath benchmark.
Can I submit a problem where I currently don't know the answer?
Yes, add the tag "open problem". Please do this if you are confident that you could validate and recognize a correct answer. For these it will be particularly important to have you as a question author also willing to contribute to grading the AI answers.

Project Vision

Goals, timeline, and broader impact

What are the goals of the pilot project?
Our summer 2025 pilot aims to:
  • Collect 25-50 high-quality problems by end of August 2025
  • Test against frontier AI models to establish baseline capabilities
  • Submit a proof-of-concept paper to ICLR 2026
  • Focus initially on algebraic geometry and related fields
  • Validate our evaluation methodology and refine the contribution process
When will benchmark results be published?
We plan to finalize the pilot phase by September 2025 and submit initial results to ICLR 2026. Results will include both quantitative performance metrics and qualitative analysis of AI strengths and weaknesses in mathematical reasoning. Ongoing results may be shared as the project scales beyond the pilot.
How will this benchmark help/harm AI development?
IMProofBench is designed to measure progress without accelerating capabilities. By maintaining a private dataset and focusing on evaluation rather than providing training data, we aim to offer unbiased measurement of AI mathematical reasoning without contributing to potentially concerning capability advances. Our focus is on understanding current limitations rather than providing optimization targets.
Is this connected to any specific research institution or company?
IMProofBench is an academic project led by researchers at ETH Zurich and Aarhus University. We maintain independence from AI companies while seeking future collaborations with academic research institutions for community outreach and support. The project aims to serve the broader mathematical research community rather than any specific commercial interest.

Have More Questions?

Join our community discussion or contact the project team directly.