arXiv:2606.26502: reasoning models spend more tokens on tasks they fail, opposite to humans who disengage
A study by Han-yu Wang (arXiv:2606.26502) finds that large reasoning models (LRM) spend more tokens on tasks they ultimately get wrong than on those they solve correctly, the opposite of humans who disengage on harder tasks. The gap is large (Cohen's d 1.47–3.13 on the H-ARC benchmark), and all five tested models showed the reverse pattern from humans.
This article was generated using artificial intelligence from primary sources.
Why don’t models disengage when they are wrong?
The study titled Humans Disengage, Reasoning Models Persist (arXiv:2606.26502, Han-yu Wang, submitted June 25, 2026) shows that large reasoning models (LRM, Large Reasoning Models — models that generate long reasoning chains) spend more tokens on tasks they ultimately get wrong than on those they solve correctly. Humans do the opposite: they spend less time on tasks they get wrong because they disengage.
Registering difficulty versus allocating effort
The author separates two mechanisms: registration (how response time correlates with difficulty across different tasks) and allocation (whether effort increases on misses or on hits). Humans and LRMs similarly register difficulty across tasks, but diverge within the same task. The gap is large: Cohen’s d (a measure of effect size) ranges from 1.47 to 3.13 on the H-ARC benchmark, and all five tested models showed the reverse pattern from humans.
What this means for inference efficiency
The dissociation holds across multiple datasets and under fixed task effects, ruling out the explanation that it is simply a matter of difficulty. The interpretation is: LRMs extend their reasoning chain, driven by uncertainty, precisely when the probability of failure rises. The practical consequence is that a longer response is not a reliable signal of correctness — it may be a sign that the model is stuck on a problem.
Frequently Asked Questions
- What is the key difference between humans and reasoning models?
- Humans disengage and spend less time on tasks they get wrong, while reasoning models extend their reasoning chain precisely when the probability of failure is higher.
- What is Cohen's d?
- Cohen's d is a measure of effect size; values of 1.47–3.13 indicate a very large gap between token use on correct and incorrect answers.
Related news
arXiv:2606.26935: CoT training gains land in stronger action prediction, not deeper agent reasoning
GitHub: MAI-Code-1-Flash, Microsoft's coding model, now generally available in Copilot Business and Enterprise plans
Anthropic: API rate limits raised — Sonnet and Haiku now match Opus across three tiers