arXiv:2605.19762: ICML 2026 paper claims code does not improve LLM mathematical reasoning
An arXiv preprint accepted at ICML 2026 shows through controlled pre-training experiments that executable code does not by itself improve general reasoning capabilities of LLM models — code strongly improves programming but competes with mathematical tasks in standard mode. Real progress in mathematics comes from cross-domain structured reasoning traces (code-text and math-text mixtures), and mechanistic analysis of Mixture-of-Experts models reveals these interactions in expert activation patterns.
This article was generated using artificial intelligence from primary sources.
The preprint arXiv:2605.19762, accepted at ICML 2026, uses controlled pre-training experiments to challenge a widespread assumption in the LLM community: that adding code to training data automatically improves a model’s general reasoning capabilities.
What is the main claim?
The researchers trained multiple variants of the same model with controlled mixtures of pre-training data — varying proportions of code, plain text, and structured mathematical proofs. The results show that pure code strongly improves programming but not general mathematical reasoning. Moreover, code and mathematics compete for the same capacity in standard mode, so increasing the share of code can actually reduce performance on hard mathematical tasks.
What does the mechanistic analysis of Mixture-of-Experts models reveal?
The team tracked routing activations in Mixture-of-Experts (MoE) models — which experts are activated for which types of tasks. A negative interaction between coding and mathematical experts emerged in models trained with standard mixtures. The solution comes from cross-domain structured traces — code-text and math-text mixtures — which activate synergetic patterns instead of a competitive allocation.
Practical implications for pre-training labs?
The recommendation is to increase the share of structured mathematical traces (pure text proofs, step-by-step solutions, math-text mixtures) within a fixed pre-training budget. The team reports significant gains on hard mathematical benchmarks while retaining programming capabilities. This is relevant for labs working on new generations of frontier models — Anthropic, OpenAI, Google DeepMind, Meta, Mistral, DeepSeek, Qwen — and may influence the pre-training recipes of the next generation.
Frequently Asked Questions
- What is the main claim of the paper?
- The paper claims that simply adding code to pre-training improves programming ability but not general mathematical reasoning. Real progress in mathematics requires structured reasoning traces that combine code and text or mathematics and text — cross-domain mixing, not pure code.
- What does the mechanistic analysis show?
- In Mixture-of-Experts models, the researchers tracked routing activations — which experts are activated for which types of tasks. It emerged that coding and mathematical experts partly compete for the same capacity in the model, which explains the negative interaction in standard pre-training.
- What is the practical recommendation?
- The team recommends increasing the share of structured mathematical traces (pure text proofs, step-by-step solutions) in a fixed pre-training budget. The result is significant gains on hard mathematical benchmarks while retaining programming capabilities.