🟡 🤝 Agents Published: · 3 min read ·

ICML 2026 Research: SFT and RL Agents Suffer Dramatic Performance Drops Outside Controlled Benchmarks

Editorial illustration: Fragility of AI agents under distribution shifts in tool use and generalization robustness

A paper accepted at ICML 2026 systematically tests LLM tool-use agents under environment shifts across four levels — Perception, Interaction, Reasoning, and Internalization. Findings: both SFT and RL training show significant degradation under modest distribution shifts, and controlled benchmark accuracy does not predict real-world robustness. The proposed PAFT (Perturbation-Augmented Fine-Tuning) offers mitigation.

🤖

This article was generated using artificial intelligence from primary sources.

The paper “Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use” by Song-Lin Lv, Weiming Wu, Rui Zhu, Zi-Jian Cheng, and Lan-Zhe Guo was accepted at ICML 2026 and published on July 1, 2026. The research directly challenges an assumption that underlies many evaluation practices: that good benchmark accuracy means a robust agent in deployment.

A controlled sandbox for open-world stress-testing

The research team developed a reproducible sandbox that enables systematic testing of distribution shifts across four hierarchical levels:

  • Perception — shifts in how the agent receives and interprets input information
  • Interaction — changes in the interface and behavior of the tools the agent works with
  • Reasoning — changes in the logical inference requirements within a task
  • Internalization — domain shifts that require adaptation of learned knowledge

Each level models a specific type of variation that realistically arises in actual deployment but is rarely present in standard training and evaluation datasets.

Key findings: Static training creates fragility

Why does benchmark accuracy not predict robustness?

The central finding of the research is that agents trained through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) show significant performance degradation at all four levels of distribution shift — even when those shifts are modest.

The critical implication: controlled benchmark accuracy does not predict actual robustness. The gap between benchmark performance and performance under real-world conditions is large and systematically underestimated. An agent that achieves excellent results in a controlled environment can drop dramatically in performance when any aspect of tool interaction changes — even without any change to the task itself.

This directly challenges the assumption that RLHF or SFT-trained tool-use agents will reliably generalize to new tooling, new APIs, or new domains of application.

PAFT: Perturbations as part of training

As a mitigation, the researchers propose PAFT (Perturbation-Augmented Fine-Tuning) — a fine-tuning strategy that explicitly incorporates environmental perturbations into the training process. Rather than the agent learning only from static examples of correct tool use, PAFT trains on modified versions that simulate the distribution shifts that will appear in deployment.

The approach is conceptually close to data augmentation methods in computer vision — but adapted to the specific structure of variations in agentic tool-use scenarios.

Infrastructure contribution

Beyond the findings, the paper offers a concrete infrastructure contribution: a reproducible sandbox for open-world stress-testing of tool-use agents that can be applied independently of a specific model architecture. This is particularly valuable because it enables researchers and practitioners to verify the robustness of their own agents in a standardized way — rather than relying exclusively on benchmark accuracy.

Acceptance at ICML 2026 signals that the community recognizes this type of evaluation infrastructure as a methodological priority. At a moment when agentic systems are being actively deployed in production environments, understanding the limits of static training generalization becomes critical for responsible development.

Frequently Asked Questions

Why doesn't high benchmark accuracy guarantee robustness in the real world?
The research shows that standard benchmarks do not model the distribution shifts that occur in real deployments — small changes in perception, interaction, reasoning, or domain are sufficient to cause significant performance drops in agents trained exclusively on static datasets.
What is PAFT and how does it help?
PAFT (Perturbation-Augmented Fine-Tuning) is a fine-tuning method that explicitly incorporates environmental perturbations into training, making the agent more robust to distribution shifts that arise in real tool-use scenarios.
At which levels is agent robustness tested in this research?
The sandbox covers four hierarchical levels: Perception (how the agent perceives information), Interaction (how it communicates with tools), Reasoning (logical inference), and Internalization (adapting to domain changes).