I was deeply impressed by Google and OpenAI's mathematical reasoning capabilities, especially Google Deep Think winning IMO 2025. This achievement inspired me to train my own AI agent to tackle the IMO 2025 problems.
This repository is forked from lyang36/IMO25, created by Huang, Yichen and Yang, Lin F. from the Google team. I am incredibly grateful for their generosity in open-sourcing this work and making it accessible to the research community.
Special thanks to:
- Google Research Team (Huang, Yichen and Yang, Lin F.) for the original implementation
- OpenAI for developing
openai/gpt-oss-120b, the reasoning model powering my agent
I use openai/gpt-oss-120b for all mathematical reasoning tasks. This open-source model provides strong reasoning capabilities when configured with appropriate reasoning effort levels.
The original authors mentioned they resolved Problems 1-5 with Gemini 2.5 Pro. I also saw PR #18 claiming to solve problems with openai/gpt-oss-120b using low reasoning effort.
However, when I actually tested this claim, I discovered the verification system was broken and ZERO problems were actually resolved with low reasoning effort. This taught me an important lesson: don't be fooled by LLM outputs - always verify rigorously.
I systematically tested multiple AI reasoning approaches with increasing sophistication:
Why BFS is efficient:
- Parallel exploration: Generates 3 diverse solution attempts per cycle
- Best-of-N selection: Chooses solutions based on verification scores
- Natural diversity: Parallel generation provides variation without needing deduplication fixes
- Reliable convergence: Successfully finds correct solutions with high reasoning effort
- Cost-effective: Faster than iterative refinement for exploration-heavy problems
Performance with high reasoning:
- Successfully resolved all 5 problems (Problems 1-5)
- Average success rate: High with n=3 initial attempts
- Method: Systematic parallel exploration with rigorous verification
Why we don't use MCTS:
- Insufficient statistical evidence: Only n=1 sample tested (need n=64 for 80% statistical power)
- No clear advantage: MCTS showed correct answer once, but BFS eventually worked reliably
- Complexity overhead: Tree-guided exploration (UCB1) adds complexity without proven benefit
- Phase 1 mismatch: Deduplication/adaptive temperature fixes solve problems neither BFS nor MCTS actually have
- Recommendation from analysis: Expert panel recommended proper experiments before choosing MCTS
See docs/industry_reports/NETFLIX_DATA_SCIENCE_BFS_VS_MCTS_ANALYSIS.md for detailed statistical analysis.
Why we don't use RLAC:
- Fundamental architectural limitation: Can strengthen weak proofs of correct answers, but cannot recover when the answer itself is wrong
- Infinite loop problem: Enters endless cycle trying to prove wrong answer with different approaches rather than reconsidering the answer
- Counterexample validation failures: Generates abstract counterexamples without concrete numerical values, causing false positives
- Answer lock issues: Mechanism prevents legitimate corrections once an answer gets locked
- Low reasoning in defense mode: Reduces effectiveness of adversarial refinement
- 0% success rate: Timeout on all test cases despite completing 15 adversarial rounds
See docs/rlac_test_log_analysis_summary.md for failure analysis.
Why formula derivation is powerful:
- 10-100x speedup: For formula-based problems, derives answer in seconds vs. hours of search
- Mathematical elegance: LLM recognizes patterns from independently verified small cases
- Cost efficiency: $0.0001 vs $12-75 for full BFS runs
- Independent verification: Small cases verified via CP-SAT constraint solver (no circular reasoning)
- Perfect for Problem 6: Derived formula
n+2k-3from 3 verified cases, got correct answer2112instantly
Example (Problem 6):
Small cases (verified independently):
n=4 (k=2): 5 tiles
n=9 (k=3): 12 tiles
n=16 (k=4): 21 tiles
LLM derived pattern: f(n,k) = n + 2k - 3
Applied to n=2025 (k=45):
2025 + 2(45) - 3 = 2112 ✓
See logs/test_imo06_output.log and docs/validation/BFS_VALIDATION_FINAL_REPORT.md for details.
I successfully resolved all 6 IMO 2025 problems with high reasoning effort, confirmed against ground truth from official IMO solutions (papers/IMO-2025-notes.pdf):
| Problem | Answer | Method | Log Reference |
|---|---|---|---|
| Problem 1 | k ∈ {0, 1, 3} | BFS (high reasoning, n=3) | bfs_validate_high_n3_problem1 |
| Problem 2 | Proof (geometry) | BFS (high reasoning, n=3) | bfs_validate_high_n3_problem2 |
| Problem 3 | c = 4 | BFS (high reasoning, n=3) | bfs_validate_high_n3_problem3 |
| Problem 4 | a₁ = 12^e · 6 · ℓ | BFS (high reasoning, n=3) | bfs_validate_high_n3_problem4 |
| Problem 5 | λ > 1/√2 (Alice wins) | BFS (high reasoning, n=3) | bfs_validate_high_n3_problem5 |
| Problem 6 | 2112 | Formula derivation + verified small cases | logs/test_imo06_output.log |
Ground truth confirmation: All answers verified against Evan Chen's official IMO 2025 solution notes (papers/IMO-2025-notes.pdf).
Problem 6 was SIGNIFICANTLY HARDER THAN EXPECTED:
- Human success rate: ~1% (even expert mathematicians struggled)
- AI success rate: 0% (all frontier models failed)
- Model behavior:
openai/gpt-oss-120balways generates similar wrong constructions and wrong answers - Gemini 3 Pro advantage: With luchadore_lunchables's reddit prompt, only Gemini 3 Pro occasionally generates correct solutions; GPT-5 and Claude cannot
All traditional methods failed:
- ❌ BFS with high reasoning: Wrong answer
- ❌ MCTS exploration: Wrong answer
- ❌ RLAC adversarial refinement: Stuck proving wrong answer
- ❌ Direct prompting: Wrong constructions repeatedly
The problem: LLMs get stuck in local optima, generating plausible-but-wrong tile configurations.
As a strong engineer but not a mathematician, I couldn't craft correct constructions manually. Instead, I combined my engineering skills with AI's pattern recognition:
My approach:
-
Brute-force small cases using CP-SAT constraint programming solver:
- n=4 (k=2): Exhaustively verified → 5 tiles (100% confidence)
- n=9 (k=3): Official IMO answer → 12 tiles (ground truth)
- n=16 (k=4): Constraint solver → 21 tiles (verified independently)
-
Let LLM derive the formula from these verified cases:
- Pattern: 5, 12, 21 for k=2, 3, 4
- LLM recognized: f(n,k) = n + 2k - 3
- Verification: All cases match ✓
-
Apply to target problem:
- n=2025, k=45 (since 45² = 2025)
- f(2025, 45) = 2025 + 2(45) - 3 = 2112 ✓
Why this works:
- ✅ No circular reasoning (small cases verified independently via CP-SAT)
- ✅ Clean, simple, mathematically elegant
- ✅ Fast (38 seconds) and cheap ($0.0001)
- ✅ Confirmed by IMO official solution notes
See implementation in logs/test_imo06_output.log.
- Verification is critical - LLMs can confidently produce wrong answers. Always verify rigorously.
- High reasoning matters - Low/medium reasoning fails; high reasoning + BFS succeeds.
- Formula recognition works - For pattern-based problems, LLMs excel at deriving formulas from verified small cases.
- Engineering + AI - Combining constraint solvers (brute-force) with LLM pattern recognition solves problems neither can solve alone.
- Don't trust claims - Always test. PR #18 claimed success with low reasoning, but verification was broken.
This work builds upon extensive research and community contributions:
- Official IMO Solutions:
papers/IMO-2025-notes.pdfby Evan Chen - Research Papers: All papers in
papers/folder - Community Insights: luchadore_lunchables's reddit analysis on Problem 6
- Original Repository: lyang36/IMO25 by Google Research Team
- Agent Usage Guide:
agent_gpt_oss.md- Complete reference forcode/agent_gpt_oss.py - Architecture:
CLAUDE.md- System architecture and development guidelines - Source Code:
code/- All agent implementations and core modules - Problem Statements:
problems/- All 6 IMO 2025 problem files - Test Suite:
test/- Unit tests and integration tests - Test Logs:
logs/- Validation run logs and test results - Analysis Scripts:
scripts/- Shell and Python analysis scripts - BFS Analysis:
docs/bfs/- Comprehensive BFS performance reports - Formula Derivation:
docs/validation/- Formula derivation methodology and validation - Bug Fixes:
docs/bugs_rca/- Root cause analyses and fixes
# Install dependencies
pip install -r requirements.txt
# Set API keys
export GPT_OSS_API_URL=https://openrouter.ai/api/v1/chat/completions
export GPT_OSS_API_KEY=your_openrouter_api_key
export GPT_OSS_MODEL_NAME=openrouter/openai/gpt-oss-120b
# Run BFS with high reasoning (recommended)
python code/agent_gpt_oss.py problems/imo01.txt \
--num-initial-attempts 3 \
--solution-reasoning high \
--verification-reasoning high \
--log output.log
# Run formula derivation for Problem 6
python code/agent_gpt_oss.py problems/imo06.txt \
--use-formula-derivation \
--formula-reasoning high \
--log output.logSee agent_gpt_oss.md for complete usage documentation.
This was the last project I enjoyed in 2025 (sorry for a few days' delay!). I will continue exploring super-intelligence research in 2026.
Feel free to reach out or follow my work:
Note to researchers: This repository demonstrates that combining traditional engineering approaches (constraint solvers, brute-force verification) with modern LLM capabilities (pattern recognition, formula derivation) can solve problems that neither approach can solve alone. The key insight for Problem 6 was recognizing when to stop asking the LLM to construct solutions and instead ask it to recognize patterns in independently verified data.