arXiv is an open archive of scientific preprints at Cornell University. Papers there are not always peer-reviewed, but many end up at conferences.

What is a multimodal benchmark?

A multimodal benchmark contains multiple types of inputs, most commonly text and images. This is important for mathematics because olympiad problems include geometric diagrams and graphs.

MathNet: 30,676 olympiad problems from 47 countries, SOTA models still fall short

An MIT research team led by Shaden Alshammari published MathNet, the largest multimodal benchmark of olympiad math problems to date. The paper was accepted at the ICLR 2026 conference.

What MathNet brings

MathNet contains 30,676 problems with expert-written solutions, collected from two decades of mathematics olympiads in 47 countries and 17 languages. The dataset is multimodal, meaning it includes both textual formulations and the diagrams, graphs, and geometric sketches that are unavoidable in olympiad mathematics. The benchmark measures three different tasks: problem solving, mathematical retrieval accuracy, and retrieval-augmented problem solving. For the last two tasks, researchers manually curated pairs of problems that are mathematically equivalent but structurally formulated differently.

Results from current models

SOTA models show significant gaps. Gemini-3.1-Pro achieves 78.4% accuracy, while GPT-5 reaches 69.3%. Although these are impressive numbers for complex olympiad problems, the results confirm that highest-level mathematical reasoning is still not fully solved. An interesting finding is that embedding models — which transform text into numerical vectors for similarity search — have significant difficulty finding mathematically equivalent problems when they are formulated with different vocabulary. This is critical because retrieval-augmented approaches depend on search quality.

Why retrieval quality changes the game

The model DeepSeek-V3.2-Speciale gained up to 12 percentage points of improvement when provided with high-quality retrieval of related problems. This suggests that future progress in mathematical AI will not come from larger models alone, but also from better embedding architectures specific to mathematical semantics. Classical text embeddings are trained on general corpora where two problems about Diophantine equations can look very different if they use different notations or languages. The need for specialized mathematical embeddings creates opportunities for new research directions, and MathNet provides a standardized set of pairs for their evaluation. The dataset and benchmark are publicly available at mathnet.mit.edu under a Creative Commons BY 4.0 license. The authors are Kevin Wen, Abrar Zainal, Mark Hamilton, Navid Safaei, Sultan Albarakati, William T. Freeman, and Antonio Torralba, and an active community around maintaining and expanding the dataset is expected.

MathNet: 30,676 olympiad problems from 47 countries, SOTA models still fall short

MathNet: 30,676 olympiad problems from 47 countries, SOTA models still fall short

What MathNet brings

Results from current models

Why retrieval quality changes the game

Sources

Related news