🟡 🏥 In Practice Published: · 4 min read ·

AWS SageMaker Guide: In Multi-Turn RL, Reward and Evaluation Matter More Than the Algorithm

Editorial illustration: AWS SageMaker agentic fine-tuning workflows and multi-turn reinforcement learning

The AWS SageMaker AI guide for multi-turn reinforcement learning places reward function quality and evaluation independence ahead of algorithmic choice. Dense rewards prevent variance collapse, and reward hacking occurs when the agent optimizes the metric without solving the actual task. On the SOP-Bench benchmark, correctly configured training achieved a 13 percent better Task Success Rate and around 16 percent better per-field accuracy.

🤖

This article was generated using artificial intelligence from primary sources.

AWS has published a comprehensive guide for multi-turn reinforcement learning on the Amazon SageMaker AI platform. The focus is not on exotic algorithms or infrastructure scaling — the central thesis of the document is simple and directly counter to common assumptions: the quality of the reward function and the independence of evaluation determine whether training will produce a useful agent, far more than the choice of RL algorithm or hyperparameter configuration.

Reward and evaluation matter more than the algorithm

Multi-turn reinforcement learning differs from standard RL in that the agent must make sequential decisions across multiple turns, and the context grows with each interaction. SageMaker AI provides a modular agent-and-environment interface for this, asynchronous rollout collection with controlled off-policy staleness, and native algorithms: PPO, CISPO, and importance-sampling losses. The platform also offers sequence-extension training for managing long trajectories, plus MLflow integration for per-turn tracking.

But the guide makes it clear: no algorithmic shortcut compensates for a poorly designed reward or insufficiently independent evaluation. Both elements must be correctly set up before training even begins. AWS defines a clear priority hierarchy: collect and split representative data, build a hermetic environment, define an independent test set, establish a performance baseline — and only then design the reward and start training.

What are the most common pitfalls in reward function design?

The first pitfall is Goodhart’s Law in the context of RL: an agent that optimizes the reward metric without solving the actual task. The AWS document gives concrete indicators of reward hacking — if the training reward rises while the validation reward remains flat, or if the base model achieves a higher reward on the training set than external evaluation shows, this signals that the reward parser is letting through cases that the evaluation criterion grades more strictly. The fix is to tighten the parser and conduct an offline review of new rollouts.

The second pitfall is the binary reward. If all rollouts in a group receive an identical score — all zeros or all ones — the gradient vanishes and training stagnates. The guide recommends dense reward functions that give partial credit for progress toward the solution even when the final answer is wrong. For diagnostics, track rollout/reward/zero_frac — the fraction of trajectories with a zero reward — and reduce group_size from 8 to 4 if the fraction is too high.

The third pitfall is self-evaluation: a system that measures its own success cannot detect its own reward hacking. AWS stresses the requirement for independent external evaluation on a held-out test set, with criteria stricter than those used in the training reward. The distinction between measuring generalization and measuring independence from reward hacking is a key difference the guide explicitly emphasizes.

Managing context across multiple turns

Multi-turn agents face a specific problem that does not exist in single-turn RL: context grows with each interaction and can become computationally expensive or semantically stale. AWS recommends setting max_turns = ceil(N × 1.5) where N corresponds to the typical number of turns a skilled human needs for the same task. If more than 5 percent of responses hit the per-turn token limit, sampling_max_tokens should be increased, as clustering of responses at the boundary indicates a structural constraint.

Four metrics are key for monitoring training health: the fraction of trajectories with a zero reward (zero_frac), the fraction of discarded rollout groups due to uniform scores (zero_adv_groups), and pass-rate on the validation set at one attempt (pass_k_1) and eight attempts (pass_k_8). A drop or stagnation in pass_k_1 while zero_adv_groups remains high is a signal to reduce group_size or increase rollout diversification.

A special danger is policy collapse: a sudden drop of the reward toward zero after 40 to 80 training steps. AWS recommends setting async_config.max_steps_off_policy = 0 and potentially switching from CISPO to PPO. Stabilization usually occurs within 25 to 50 steps of the intervention.

Concrete results and tooling

The AWS guide illustrates the principles through the SOP-Bench benchmark for aircraft inspection. Initial training attempts — with parallel tasks, misaligned one-shot examples, and incorrect output tag formatting — produced unstable and poor results. After targeted corrections (focus on a single task, aligned examples, correct output tags), the fine-tuned model improved Task Success Rate by 13 percent and per-field accuracy by around 16 percent.

For implementation, SageMaker provides MultiTurnRLTrainer and MultiTurnRLEvaluator as high-level abstractions, the SOP-Bench dataset for standardized benchmarking, and MLflow integration for tracking trajectories at the level of each individual turn. For production deployment of trained agents, Bedrock AgentCore is the recommended path.

The guide is aimed at ML engineers building agents for real-world tasks — from customer request resolution to content moderation. The core conclusion holds regardless of domain: investing in a properly hermetic training environment and genuinely independent evaluation yields far more than iterating on algorithms and hyperparameters.

Frequently Asked Questions

What is reward hacking and how do you recognize it?
Reward hacking occurs when an agent optimizes the reward metric without actually solving the task (Goodhart's Law in RL). Safe signals: training reward rises while validation reward remains flat, or the base model achieves a higher training reward than external evaluation shows.
Why are binary rewards problematic in multi-turn RL?
If all rollouts in a group receive an identical score (all zeros or all ones), the gradient vanishes and training stagnates. Dense reward functions that give partial credit for progress toward the solution effectively resolve this problem.
How should max_turns be determined for a multi-turn agent?
AWS recommends max_turns = ceil(N × 1.5) where N corresponds to the typical number of turns a skilled human needs for the same task. If more than 5 percent of responses hit the per-turn token limit, sampling_max_tokens should be increased.