What problem does the paper investigate?

Whether an LLM trained to solve shortest-path problems on small graphs can generalize to larger graphs with longer paths and to unseen map topologies.

Why does this matter for AI agents?

Because many practical agent tasks (multi-step planning, software engineering, research) require a long horizon — if LLMs cannot scale with length, agent autonomy is fundamentally limited.

LLMs Learn the Shortest Path on Graphs — But Fail When the Task Horizon Grows

What Was Tested?

A research team of Tong, Ye, Borovykh, and Shokri published a paper on arXiv that systematically analyzes whether an LLM can achieve systematic generalization on a classic algorithmic problem — finding the shortest path in a graph. Testing covered two independent dimensions of generalization:

Spatial transfer — can a model trained on one set of graphs solve problems on unseen maps with different topologies?
Horizon scaling — can a model trained on shorter paths (say, 5–10 steps) correctly solve longer paths (50+ steps)?

This methodology is deliberately broader than standard benchmark types — it does not only measure whether questions are new, but whether the structural demands are harder than what the model saw during training.

What Did They Find?

The results are consistent and noteworthy:

Spatial transfer: successful. Models that learn to find paths in a set of graphs successfully generalize to unseen topologies of the same size. This means “learning an algorithm” is possible to some degree.
Length scaling: consistent failure. When path length extends beyond the training range, models fail due to recursive instability — small errors in one step accumulate exponentially by the end.

Additionally, three interventions were tested:

Data coverage defines the capability boundary — a model knows what it saw in training; scaling beyond that does not work.
RL (reinforcement learning) improves stability within the training range, but does not extend generalization boundaries.
Inference-time scaling (larger token budget, chain-of-thought) helps, but does not solve the fundamental length-scaling problem.

Why Does This Matter for Autonomous Agents?

Many practical agent tasks require a long horizon: multi-step planning, research, software engineering projects that span days, iterative debugging of complex systems. If LLMs structurally cannot scale with length — which this paper suggests — then agent autonomy is fundamentally bounded by the size of problems seen in training.

This aligns with earlier findings (e.g., the LongCoT benchmark where GPT achieves 9.8% on long chain-of-thought reasoning): even seemingly the strongest models collapse when the problem grows longer.

What Does This Mean in Practice?

The researchers do not claim the problem is unsolvable, but identify three unavoidable truths:

Synthetic dataset coverage must explicitly include long paths — otherwise the model will never learn how to handle them.
RL and inference-time scaling are not magic wands — they improve what the model already learned, but do not add new systematic capability.
Architectural changes (hierarchical agents, planning with explicit state management) may be necessary for true length generalization.

For AI news readers, the takeaway is: the next time you read that a model performs “autonomous research projects,” ask how deep that horizon actually is, and whether the problem is within or outside that model’s training range.

LLMs Learn the Shortest Path on Graphs — But Fail When the Task Horizon Grows

What Was Tested?

What Did They Find?

Why Does This Matter for Autonomous Agents?

What Does This Mean in Practice?

Sources

Related news