IMProofBench

Informal Mathematical Proof Benchmark

IMProofBench evaluates the ability of AI systems to create research-level mathematical proofs. We maintain a curated, private repository of PhD-level problems across pure mathematics to measure genuine mathematical reasoning capabilities while preventing data contamination and benchmark overfitting.

Model Performance

Top models by percentage of benchmark questions with complete and correct solutions.

Benchmark Results

GPT-5.4 46.4%

Gemini 3.1 Pro Preview 45.7%

GPT-5.2 Pro (web search) 39.0%

Claude Opus 4.6 33.6%

46.4%

GPT-5.4

45.7%

Gemini 3.1 Pro Preview

39.0%

GPT-5.2 Pro (web search)

33.6%

Claude Opus 4.6

Become a Contributor!

Create a question and see what state-of-the-art models can do in your field of mathematics. If your question is included in the benchmark, receive co-authorship on future papers. You retain full rights to your question and can retract it at any time.

Create a Question

Community

Connect with the project team and mathematical researchers.

Team Zulip Chatroom

About

Learn about the benchmark's goals and methodology.

Preprint FAQ Funding

Key Features

AI Model Testing

Test problems against frontier models with immediate feedback

Peer Review System

Expert review ensures problem quality and appropriate difficulty

Automated Grading

Subquestions enable objective evaluation alongside proof assessment

Privacy Preservation

Majority private dataset prevents overfitting and gaming