arXiv:2606.26935: CoT training boosts action

A study by Jingyu Liu and colleagues (arXiv:2606.26935) shows that gains from chain-of-thought (CoT) training in LLM agents land in stronger direct action prediction rather than broader reasoning advantage. Later checkpoints revise the action less frequently, while masking supervision over action tokens improves out-of-domain generalization.

Where do CoT training gains actually land?

The study titled Where Do CoT Training Gains Land in LLM based Agents? (arXiv:2606.26935, Jingyu Liu and colleagues, submitted June 25, 2026) argues that chain-of-thought training gains land in direct action prediction rather than deeper reasoning. CoT (Chain-of-Thought, chain of thought) is a technique in which the model generates reasoning steps before the final decision. The authors compare prompt actions (without CoT) to CoT actions across training checkpoints.

Checkpoint comparison method

The quality of prompt actions grew substantially during training, while the relative advantage of CoT over direct prediction remained stable. In other words, CoT training did not expand the chain-of-thought advantage — it increased the model’s ability to guess the correct action directly. At later checkpoints, models became less likely to revise the action in response to CoT, indicating increasing reliance on the initial estimate.

Supervision masking intervention

The authors test an intervention: masking supervision over action tokens on a subset of examples during training. This change improved out-of-domain generalization. The finding challenges the widespread assumption that CoT training teaches models to reason better through a problem — instead, the model simply guesses the outcome more reliably.

Frequently Asked Questions

What is CoT (chain of thought)?

CoT (Chain-of-Thought, chain of thought) is a technique in which the model generates reasoning steps before the final action or answer.

What does the study reveal about CoT training?

That training gains primarily strengthen direct action prediction, while the advantage of CoT over direct prediction does not grow during training.

arXiv:2606.26935: CoT training gains land in stronger action prediction, not deeper agent reasoning

Where do CoT training gains actually land?

Checkpoint comparison method

Supervision masking intervention

Frequently Asked Questions

Sources

Related news