arXiv:2606.26935: CoT training gains land in stronger action prediction, not deeper agent reasoning
A study by Jingyu Liu and colleagues (arXiv:2606.26935) shows that gains from chain-of-thought (CoT) training in LLM agents land in stronger direct action prediction rather than broader reasoning advantage. Later checkpoints revise the action less frequently, while masking supervision over action tokens improves out-of-domain generalization.
This article was generated using artificial intelligence from primary sources.
Where do CoT training gains actually land?
The study titled Where Do CoT Training Gains Land in LLM based Agents? (arXiv:2606.26935, Jingyu Liu and colleagues, submitted June 25, 2026) argues that chain-of-thought training gains land in direct action prediction rather than deeper reasoning. CoT (Chain-of-Thought, chain of thought) is a technique in which the model generates reasoning steps before the final decision. The authors compare prompt actions (without CoT) to CoT actions across training checkpoints.
Checkpoint comparison method
The quality of prompt actions grew substantially during training, while the relative advantage of CoT over direct prediction remained stable. In other words, CoT training did not expand the chain-of-thought advantage — it increased the model’s ability to guess the correct action directly. At later checkpoints, models became less likely to revise the action in response to CoT, indicating increasing reliance on the initial estimate.
Supervision masking intervention
The authors test an intervention: masking supervision over action tokens on a subset of examples during training. This change improved out-of-domain generalization. The finding challenges the widespread assumption that CoT training teaches models to reason better through a problem — instead, the model simply guesses the outcome more reliably.
Frequently Asked Questions
- What is CoT (chain of thought)?
- CoT (Chain-of-Thought, chain of thought) is a technique in which the model generates reasoning steps before the final action or answer.
- What does the study reveal about CoT training?
- That training gains primarily strengthen direct action prediction, while the advantage of CoT over direct prediction does not grow during training.
Related news
arXiv:2606.26502: reasoning models spend more tokens on tasks they fail, opposite to humans who disengage
GitHub: MAI-Code-1-Flash, Microsoft's coding model, now generally available in Copilot Business and Enterprise plans
Anthropic: API rate limits raised — Sonnet and Haiku now match Opus across three tiers