IMProofBench

Informal Mathematical Proof Benchmark

IMProofBench evaluates the ability of AI systems to create research-level mathematical proofs. We maintain a curated, private repository of PhD-level problems across pure mathematics to measure genuine mathematical reasoning capabilities while preventing data contamination and benchmark overfitting.

Benchmark Status

147

Draft

30

Under Review

42

Accepted

39

Graded

Question Submission Pipeline

156

Participants

10

Models

Questions

Create and review mathematical proof problems to test frontier AI models.

Community

Connect with mathematical researchers and track contributions.

Dashboard

Real-time statistics and benchmark performance metrics.

Top Models

Percentage of questions with complete and correct solution

1. GPT-5
23.1%
2. Grok 4
17.9%
3. Gemini 2.5 Pro
7.7%
4. o4-mini
5.1%

About

Learn about the benchmark's goals, methodology, and team.

Key Features

AI Model Testing
Test problems against frontier models with immediate feedback
Peer Review System
Expert review ensures problem quality and appropriate difficulty
Automated Grading
Subquestions enable objective evaluation alongside proof assessment
Privacy Preservation
Majority private dataset prevents overfitting and gaming