The reasoning structure of large language models

Researchers at ETH Zurich have presented a benchmark of logic puzzles and a pipeline that turns reasoning traces into measurable graphs of claims and dependencies. A new metric quantifies reasoning efficiency and reveals differences that accuracy and token count cannot tell apart.

Reasoning models today are mostly evaluated by two numbers: whether they gave the correct answer and how many tokens they used. A team at ETH Zurich — Frédéric Berdoz, Luca A. Lanzendörfer, Fabian Farestam, and Roger Wattenhofer — argues that those two numbers hide a great deal, and offers a tool that peers into the very structure of reasoning.

How is the structure of reasoning measured?

The authors build a scalable benchmark of logic puzzles and a pipeline that turns an unstructured reasoning trace — the sequence of a model’s thinking steps — into a verifiable graph. In that graph, the nodes are individual claims and the edges are logical dependencies between them. This makes it visible whether the model builds a tidy, connected argument or wanders into dead ends that lead nowhere.

What does the new metric reveal?

Alongside the graph, the authors introduce a metric that quantifies reasoning efficiency — how concentrated the model’s logical flow is. The main finding of their analysis over open-source systems is that structural measurements separate behaviors that accuracy and token count merge into one. Two models can have the same accuracy and similar response length, yet entirely different reasoning structures.

Why does this matter?

The practical value lies in diagnostics. The structural view makes it possible to identify the ways a model goes wrong and to compare how reasoning changes with puzzle difficulty. The paper was accepted at ICML 2026 and also presented at an ICLR 2026 workshop on reasoning in large models, suggesting growing community interest in measuring how models think, not just what they answer.

Frequently Asked Questions

What is the reasoning graph in this paper?

A graph in which the nodes are the model's claims and the edges are logical dependencies between them, obtained by converting an unstructured reasoning trace.

Why are accuracy and token count not enough?

Structural measurements separate behaviors that accuracy and token count merge, revealing how the model builds an argument, not just whether it reaches a solution.

arXiv:2606.03883: What does the reasoning structure of large language models actually look like?

How is the structure of reasoning measured?

What does the new metric reveal?

Why does this matter?

Frequently Asked Questions

Sources

Related news