Long Horizons Destabilize LLM Training — ICML 2026

An ICML 2026 accepted paper empirically demonstrates that increasing task horizon length causes serious LLM training instability due to exploration and credit assignment problems. The proposed solution: shortening the horizon during training with an explicit 'horizon generalization' mechanism at inference. The paper establishes the first empirical scaling rules for task horizon in frontier model training.

A new arXiv preprint (2605.02572), accepted for ICML 2026, establishes the first systematic empirical rules for one of the key challenges in training agentic and reasoning frontier models: the instability that emerges as the task horizon (the number of steps before the reward signal) grows. The main finding: long horizons destabilize training through two separate mechanisms — exploration and credit assignment.

What are the empirical mechanisms of destabilization?

The authors structure the paper around two independent ablations. The first isolates the exploration problem: as the horizon grows, the probability that the model randomly encounters a successful trajectory drops exponentially. This means the reward signal becomes sparse and the gradient lean — the model receives few informative updates per training step.

The second ablation focuses on the credit assignment problem: when a reward arrives after a long horizon, the gradient must propagate back through many steps. Gradient variance per step grows with horizon length — effectively, gradient noise in training overwhelms the signal beyond a certain length, and the model stops converging or begins to oscillate.

Individually, both problems are known in the RL literature. The paper’s contribution is in the empirical quantification — the authors provide scaling rules that predict when a specific LLM training run will start to destabilize depending on model size and horizon length.

What is the ‘horizon generalization’ solution?

The proposed solution is methodologically minimal but conceptually important: train the model on shorter horizons, where credit assignment is less noisy, and then at inference activate an explicit horizon generalization mechanism — the model’s ability to apply the same reasoning pattern to longer trajectories than it encountered during training. This is analogous to length generalization in sequence-to-sequence learning, but applied to multi-step reasoning and agentic sequences.

Practical implications: teams training agentic models (Anthropic, OpenAI, Google DeepMind) may not need to train directly on 1,000-step sequences; instead, they can train on 50–100 steps and use horizon generalization as an inference-time technique.

Why does this matter for frontier model design?

The paper addresses a question that is becoming increasingly relevant as long agentic sequences grow in real deployments — Claude Code, Devin, OpenAI Codex and similar tooling routinely execute 200–500 steps in a single agentic session. If the paper’s findings are confirmed, frontier labs will likely shift part of agentic scaling from “train on ever-longer horizons” toward a “train short, generalize long” approach.

Limitations: the paper is primarily empirical (no closed-form theoretical bound for where exactly destabilization occurs), and focused on specific RL setups. Validation of these rules in the context of large commercial frontier model training pipelines is the logical next step — one that likely remains unpublished for competitive reasons.

Frequently Asked Questions

What is 'task horizon' in LLM training?

Task horizon is the number of steps a model must take before receiving a reward signal — for example, the number of agentic actions before a task succeeds or fails. The longer the horizon, the harder it is for the model to learn which steps contributed to success (the credit assignment problem).

Why do long horizons destabilize training?

Empirical evidence reveals two causes: the exploration problem (the model rarely encounters successful sequences) and the credit assignment problem (when success arrives, the gradient must propagate back through many steps, introducing noise and variance). Gradient variance grows with horizon length.

What is the 'horizon generalization' solution?

The approach is to train the model on shorter horizons, where credit assignment is less noisy, and then at inference explicitly activate 'horizon generalization' — the model's ability to apply the same reasoning pattern to longer sequences than it encountered during training.

arXiv:2605.02572: Long Horizons Destabilize LLM Training — ICML 2026 Paper Offers 'Horizon Generalization' as a Solution

What are the empirical mechanisms of destabilization?

What is the ‘horizon generalization’ solution?

Why does this matter for frontier model design?

Frequently Asked Questions

Sources

Related news