arXiv:2605.06642: StraTA agentic RL framework

The StraTA framework introduces a hierarchical GRPO rollout design for RL agent training — the model first generates a high-level strategy, then executes actions within that framework. Results: ALFWorld 93.1%, WebShop 84.2%, SciWorld 63.5%. SciWorld surpasses closed-source frontier systems, proving that trajectory abstraction addresses the weaknesses of reactive agents.

The research paper “StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction” (Xue et al., arXiv:2605.06642), published 7 May 2026, presents a new approach to RL training of LLM agents through explicit strategy planning before action execution. The team from Shanghai AI Lab and Oxford achieves results that in one benchmark surpass even closed-source frontier systems.

How does hierarchical GRPO work?

GRPO (Group Relative Policy Optimization) is an RL algorithm that compares groups of samples within a batch without a separate value model. StraTA applies it hierarchically through three components: Strategy Sampling generates a compact strategic plan from the initial state, Conditioned Action Execution carries out actions within that framework, and Joint Training simultaneously optimises both strategy generation and action selection.

What do the benchmarks show?

On the ALFWorld benchmark (text-based household tasks), StraTA achieves 93.1% success. The WebShop benchmark (simulated online shopping) yields 84.2%, while SciWorld (scientific experiments) reaches 63.5% overall. The authors highlight that SciWorld results “surpass closed-source frontier models” — a rare achievement for an open RL approach.

Why does trajectory abstraction matter?

Explicit trajectory-level planning addresses two fundamental weaknesses of reactive LLM agents: limited exploratory capacity and poor credit assignment across long decision sequences. Rather than having the model “wander” through action space, a strategy anchors it to a coherent plan. Additional mechanisms for diverse strategy exploration and critical self-evaluation further increase robustness. For agentic system development, StraTA suggests that hierarchical decomposition is not merely an architectural improvement, but a foundation for efficient RL learning.

Frequently Asked Questions

What is GRPO?

GRPO (Group Relative Policy Optimization) is an RL algorithm that optimises policy by comparing groups of samples within the same batch, without needing a separate value model. The hierarchical variant in StraTA applies GRPO at two levels — strategy and action.

What problem does StraTA solve?

Classic LLM agents act reactively — they select the next action without a plan. This makes credit assignment difficult across long decision sequences. StraTA introduces an abstraction layer: the model first generates a strategic plan, then executes steps within that plan.

Which benchmarks were used?

The team evaluated the framework on three benchmarks: ALFWorld (text-based household tasks), WebShop (online shopping), and SciWorld (scientific experiments). Results achieved: 93.1%, 84.2%, and 63.5% respectively, with the SciWorld score surpassing closed-source frontier models.

arXiv:2605.06642: StraTA — agentic RL with hierarchical GRPO achieves 93.1% on ALFWorld

How does hierarchical GRPO work?

What do the benchmarks show?

Why does trajectory abstraction matter?

Frequently Asked Questions

Sources

Related news