LLMs Learn the Shortest Path on Graphs — But Fail When the Task Horizon Grows
Why it matters
A new arXiv paper systematically investigates LLM generalization on the shortest-path problem across two dimensions: spatial transfer to unseen maps works well, but horizon-length scaling consistently fails due to recursive instability. The findings have direct implications for autonomous agents — training data coverage defines the boundary of capability, RL improves stability but does not extend that boundary, and inference-time scaling helps but does not solve the length-scaling problem.
What Was Tested?
A research team of Tong, Ye, Borovykh, and Shokri published a paper on arXiv that systematically analyzes whether an LLM can achieve systematic generalization on a classic algorithmic problem — finding the shortest path in a graph. Testing covered two independent dimensions of generalization:
- Spatial transfer — can a model trained on one set of graphs solve problems on unseen maps with different topologies?
- Horizon scaling — can a model trained on shorter paths (say, 5–10 steps) correctly solve longer paths (50+ steps)?
This methodology is deliberately broader than standard benchmark types — it does not only measure whether questions are new, but whether the structural demands are harder than what the model saw during training.
What Did They Find?
The results are consistent and noteworthy:
-
Spatial transfer: successful. Models that learn to find paths in a set of graphs successfully generalize to unseen topologies of the same size. This means “learning an algorithm” is possible to some degree.
-
Length scaling: consistent failure. When path length extends beyond the training range, models fail due to recursive instability — small errors in one step accumulate exponentially by the end.
Additionally, three interventions were tested:
- Data coverage defines the capability boundary — a model knows what it saw in training; scaling beyond that does not work.
- RL (reinforcement learning) improves stability within the training range, but does not extend generalization boundaries.
- Inference-time scaling (larger token budget, chain-of-thought) helps, but does not solve the fundamental length-scaling problem.
Why Does This Matter for Autonomous Agents?
Many practical agent tasks require a long horizon: multi-step planning, research, software engineering projects that span days, iterative debugging of complex systems. If LLMs structurally cannot scale with length — which this paper suggests — then agent autonomy is fundamentally bounded by the size of problems seen in training.
This aligns with earlier findings (e.g., the LongCoT benchmark where GPT achieves 9.8% on long chain-of-thought reasoning): even seemingly the strongest models collapse when the problem grows longer.
What Does This Mean in Practice?
The researchers do not claim the problem is unsolvable, but identify three unavoidable truths:
- Synthetic dataset coverage must explicitly include long paths — otherwise the model will never learn how to handle them.
- RL and inference-time scaling are not magic wands — they improve what the model already learned, but do not add new systematic capability.
- Architectural changes (hierarchical agents, planning with explicit state management) may be necessary for true length generalization.
For AI news readers, the takeaway is: the next time you read that a model performs “autonomous research projects,” ask how deep that horizon actually is, and whether the problem is within or outside that model’s training range.
This article was generated using artificial intelligence from primary sources.
Related news
Thinking with Reasoning Skills (ACL 2026 Industry Track): fewer tokens, higher accuracy through retrieval of reasoning skills
DeepSeek releases V4-Pro and V4-Flash: two open-source models with one million token context and 80.6 on SWE Verified
OpenAI introduces GPT-5.5: the smartest model for coding, research, and complex data analysis through tools