arXiv:2606.26502: models keep going when wrong

A study by Han-yu Wang (arXiv:2606.26502) finds that large reasoning models (LRM) spend more tokens on tasks they ultimately get wrong than on those they solve correctly, the opposite of humans who disengage on harder tasks. The gap is large (Cohen's d 1.47–3.13 on the H-ARC benchmark), and all five tested models showed the reverse pattern from humans.

Why don’t models disengage when they are wrong?

The study titled Humans Disengage, Reasoning Models Persist (arXiv:2606.26502, Han-yu Wang, submitted June 25, 2026) shows that large reasoning models (LRM, Large Reasoning Models — models that generate long reasoning chains) spend more tokens on tasks they ultimately get wrong than on those they solve correctly. Humans do the opposite: they spend less time on tasks they get wrong because they disengage.

Registering difficulty versus allocating effort

The author separates two mechanisms: registration (how response time correlates with difficulty across different tasks) and allocation (whether effort increases on misses or on hits). Humans and LRMs similarly register difficulty across tasks, but diverge within the same task. The gap is large: Cohen’s d (a measure of effect size) ranges from 1.47 to 3.13 on the H-ARC benchmark, and all five tested models showed the reverse pattern from humans.

What this means for inference efficiency

The dissociation holds across multiple datasets and under fixed task effects, ruling out the explanation that it is simply a matter of difficulty. The interpretation is: LRMs extend their reasoning chain, driven by uncertainty, precisely when the probability of failure rises. The practical consequence is that a longer response is not a reliable signal of correctness — it may be a sign that the model is stuck on a problem.

Frequently Asked Questions

What is the key difference between humans and reasoning models?

Humans disengage and spend less time on tasks they get wrong, while reasoning models extend their reasoning chain precisely when the probability of failure is higher.

What is Cohen's d?

Cohen's d is a measure of effect size; values of 1.47–3.13 indicate a very large gap between token use on correct and incorrect answers.

arXiv:2606.26502: reasoning models spend more tokens on tasks they fail, opposite to humans who disengage

Why don’t models disengage when they are wrong?

Registering difficulty versus allocating effort

What this means for inference efficiency

Frequently Asked Questions

Sources

Related news