IMProofBench

Informal Mathematical Proof Benchmark

IMProofBench evaluates the ability of AI systems to create research-level mathematical proofs. We maintain a curated, private repository of PhD-level problems across pure mathematics to measure genuine mathematical reasoning capabilities while preventing data contamination and benchmark overfitting.

Benchmark Status

212

Draft

34

Under Review

63

Accepted

55

Graded

Question Submission Pipeline

308

Participants

28

Models

Questions

Create and review mathematical proof problems to test frontier AI models.

Browse Questions Problem Guidelines Create New Question

Community

Connect with mathematical researchers and track contributions.

Participants & Leaderboard Zulip Chatroom

Dashboard

Real-time statistics and benchmark performance metrics.

Top Models

Percentage of questions with complete and correct solution

1. GPT-5

20.4%

2. Grok 4

14.8%

3. Gemini 2.5 Pro

7.4%

4. Claude Opus 4.1

3.7%

20.4%

GPT-5

14.8%

Grok 4

7.4%

Gemini 2.5 Pro

3.7%

Claude Opus 4.1

Benchmark Results

About

Learn about the benchmark's goals, methodology, and team.

Preprint Team Timeline FAQ Funding

Key Features

AI Model Testing

Test problems against frontier models with immediate feedback

Peer Review System

Expert review ensures problem quality and appropriate difficulty

Automated Grading

Subquestions enable objective evaluation alongside proof assessment

Privacy Preservation

Majority private dataset prevents overfitting and gaming