ArXiv AEM: Adaptive entropy modulation for RL agents

AEM (Adaptive Entropy Modulation) is a supervision-free training method that dynamically modulates entropy across multi-turn conversations to balance exploration and exploitation in RL-trained agentic LLMs. Tested on models from 1.5B to 32B parameters, it delivers a 1.4% improvement when integrated into a state-of-the-art baseline on SWE-bench Verified.

A team of authors — Haotian Zhao, Yuxin Zhang, Songlin Zhou, and collaborators — has published AEM (Adaptive Entropy Modulation), a supervision-free training method for agentic LLMs via reinforcement learning (RL) that directly addresses the unstable training of multi-turn tasks.

What problem does AEM solve?

Standard RL methods for multi-turn agentic tasks suffer from unstable training because the agent must balance exploration and exploitation differently in early and late turns of a conversation. In early turns, the agent is still discovering what the task looks like; in later turns it already has signal and should exploit the best solutions. Fixed RL hyperparameters fail to capture this dynamic.

Standard token-level entropy bonuses perform poorly because the entropy of an individual token is a weak proxy for “how much the system is exploring” in a multi-turn sense.

How does adaptive modulation work?

AEM analyzes entropy at the response level, not the individual token level. The authors derive a practical proxy that enables a natural transition from exploration to exploitation, guided by two signals:

Advantage — a score of how much better the response is than the baseline policy
Relative response surprisal — how “unexpected” the response is under the current model

This system is not supervised — it requires no manual annotation of “when to explore,” instead measuring the training state directly.

Which models and benchmark?

Experiments cover models from 1.5 to 32 billion parameters. The primary evaluation is on SWE-bench Verified, the industry standard for agentic LLMs on programming tasks.

Result: 1.4% absolute improvement when AEM is integrated into a state-of-the-art baseline. A solid if not dramatic gain — significant because it arrives without additional supervisors or changes to the base RL formulation.

What does this say about the broader trend?

AEM is the fifth paper on RL training of agentic systems in the last two weeks on ArXiv — alongside Latent-GRPO (May 2) and Exploration Hacking (May 2) from previous runs. The field is intensely focused on stabilizing multi-turn training, which is a prerequisite for reliable production agents. AEM’s supervision-free approach is particularly attractive for labs that cannot collect manually annotated training data at the required scale.

Frequently Asked Questions

What is entropy in the context of RL training for LLMs?

A measure of uncertainty in next-token or response selection — higher entropy means more exploration of different options, lower entropy means exploiting already-learned patterns.

Why does AEM modulate entropy at the response level rather than the token level?

Token-level entropy correlates poorly with multi-turn agent behavior quality; response-level analysis offers a more precise proxy for when exploration should transition to exploitation.

What is SWE-bench Verified?

An industry-standard benchmark for evaluating agentic LLMs on software engineering tasks — verifying solutions to real GitHub issues.

ArXiv AEM: Adaptive Entropy Modulation for multi-turn RL agents achieves +1.4% on SWE-bench Verified

What problem does AEM solve?

How does adaptive modulation work?

Which models and benchmark?

What does this say about the broader trend?

Frequently Asked Questions

Sources

Related news