MathNet: 30,676 olympiad problems from 47 countries, SOTA models still fall short
Why it matters
An MIT team published MathNet, a multimodal benchmark with 30,676 olympiad math problems from 47 countries and 17 languages. Gemini-3.1-Pro achieves 78.4%, GPT-5 69.3%, and embedding models have significant difficulty finding mathematically equivalent problems.
MathNet: 30,676 olympiad problems from 47 countries, SOTA models still fall short
An MIT research team led by Shaden Alshammari published MathNet, the largest multimodal benchmark of olympiad math problems to date. The paper was accepted at the ICLR 2026 conference.
What MathNet brings
MathNet contains 30,676 problems with expert-written solutions, collected from two decades of mathematics olympiads in 47 countries and 17 languages. The dataset is multimodal, meaning it includes both textual formulations and the diagrams, graphs, and geometric sketches that are unavoidable in olympiad mathematics. The benchmark measures three different tasks: problem solving, mathematical retrieval accuracy, and retrieval-augmented problem solving. For the last two tasks, researchers manually curated pairs of problems that are mathematically equivalent but structurally formulated differently.
Results from current models
SOTA models show significant gaps. Gemini-3.1-Pro achieves 78.4% accuracy, while GPT-5 reaches 69.3%. Although these are impressive numbers for complex olympiad problems, the results confirm that highest-level mathematical reasoning is still not fully solved. An interesting finding is that embedding models — which transform text into numerical vectors for similarity search — have significant difficulty finding mathematically equivalent problems when they are formulated with different vocabulary. This is critical because retrieval-augmented approaches depend on search quality.
Why retrieval quality changes the game
The model DeepSeek-V3.2-Speciale gained up to 12 percentage points of improvement when provided with high-quality retrieval of related problems. This suggests that future progress in mathematical AI will not come from larger models alone, but also from better embedding architectures specific to mathematical semantics. Classical text embeddings are trained on general corpora where two problems about Diophantine equations can look very different if they use different notations or languages. The need for specialized mathematical embeddings creates opportunities for new research directions, and MathNet provides a standardized set of pairs for their evaluation. The dataset and benchmark are publicly available at mathnet.mit.edu under a Creative Commons BY 4.0 license. The authors are Kevin Wen, Abrar Zainal, Mark Hamilton, Navid Safaei, Sultan Albarakati, William T. Freeman, and Antonio Torralba, and an active community around maintaining and expanding the dataset is expected.
This article was generated using artificial intelligence from primary sources.
Related news
OpenAI introduces GPT-5.5: the smartest model for coding, research, and complex data analysis through tools
Apple introduces MANZANO — a unified multimodal model that balances image understanding and generation
Google announces GA of gemini-embedding-2: first multimodal embedding model with 5 modalities in one space