arXiv SU-01: 30B model + gold medal at IMO/USAMO

SU-01 is a new reasoning training methodology published on May 14, 2026 on arXiv (Yafu Li and 27 co-authors, corresponding author Runzhe Zhan). A 30B parameter A3B backbone reaches gold-medal performance on the International Mathematical Olympiad 2025, USAMO 2026, and International Physics Olympiad 2024-2025 through three sequential phases: reverse-perplexity curriculum SFT on 340K trajectories, two-stage RL, and test-time scaling. Reasoning chains reach 100K+ tokens.

Yafu Li (corresponding contact Runzhe Zhan) and 27 co-authors published SU-01 on May 14, 2026 — a unified methodology for turning reasoning backbones into olympiad-level solvers. The 30B parameter A3B model reaches gold-medal performance on three elite competitive benchmarks: IMO 2025, USAMO 2026, and IPhO 2024-2025.

How large is the model and how is the peak measured?

SU-01 uses a 30B parameter A3B backbone — significantly smaller than many frontier models competing in the same space. Achieving a gold medal at the International Mathematical Olympiad 2025 and USA Mathematical Olympiad 2026 suggests that training methodology is more critical than raw parameter scaling for long-horizon mathematical and physical reasoning. Reasoning chains reach over 100,000 tokens for individual problems — an indicator that the model is not guessing answers but building detailed proof traces.

How do the three training phases work?

Phase 1: Reverse-perplexity curriculum SFT. The approach uses approximately 340,000 trajectories (each under 8K tokens) in the supervised fine-tuning phase. Reverse-perplexity curriculum means the training schedule moves from the most probable trajectories (easier for the model) toward the least probable (hardest) — gradually developing proof-search and verification behaviors.

Phase 2: Two-stage RL pipeline. Reinforcement learning has two sub-stages: first with verifiable reward signals (clear binary “correct/incorrect” for mathematical answers), then with proof-level optimization (continuous reward for argument quality, not just the final answer).

Phase 3: Test-time scaling. Inference-time techniques that activate extended thinking and parallel sampling for competitive problem sets — the model allocates more compute at inference time for harder problems.

What does SU-01 mean for reasoning models in general?

The paper positions the methodology as a portable recipe that can be applied to different reasoning backbones. If a 30B model can reach gold-medal level with SU-01 training, it suggests that existing open-source models (Llama, Qwen, DeepSeek) have untapped reasoning capacity accessible through the right training pipeline. Generalization beyond mathematics is also demonstrated — IPhO (physics) results show transfer across STEM domains, not only within pure mathematics.

The approach builds on the wave of 2025-2026 papers arguing that training data quality and methodology are more decisive than scaling — complementary to arXiv:2605.10870 on memory optimization and arXiv:2605.11882 FATE safety alignment.

Frequently Asked Questions

What is the architecture of the SU-01 model?

SU-01 uses a 30B parameter A3B backbone architecture — smaller than many frontier models achieving similar olympiad reasoning, suggesting that training methodology is more critical than model size for long-horizon math/physics problem solving.

How do the three training phases work?

Phase 1 uses reverse-perplexity curriculum SFT on 340K trajectories (each under 8K tokens) to develop proof-search and verification behaviors; Phase 2 is a two-stage RL pipeline from verifiable rewards to proof-level optimization; Phase 3 adds test-time scaling techniques for competitive problem sets.

arXiv:2605.13301 SU-01: 30B model reaches gold-medal level at IMO 2025, USAMO 2026, and IPhO through three-phase training

How large is the model and how is the peak measured?

How do the three training phases work?

What does SU-01 mean for reasoning models in general?

Frequently Asked Questions

Sources

Related news