arXiv FATE: 33.5% fewer attacks on LLM agents

FATE is a new approach to safety alignment for LLM agents published on arXiv on 12 May 2026 by Bo Yin, Qi Li and Xinchao Wang. Instead of classical RLHF that scores individual responses, FATE converts verifier-scored failure trajectories into on-policy repair supervision and Pareto-Front Policy Optimization. Results show a 33.5% reduction in attack success rate and 82.6% lower harmful compliance.

Bo Yin, Qi Li and Xinchao Wang published on 12 May 2026 an arXiv paper addressing a key limitation of existing safety alignment methods for tool-using LLM agents — their focus on individual responses rather than entire execution trajectories. The proposed FATE framework (Failure-Trajectory Adversarial Training Evolution) captures failure types that response-level signals miss and demonstrates significant safety gains.

What problem do classical safety methods miss?

Tool-using agents do not fail only in the final response — failures manifest across the entire trajectory: unsafe tool calls, instruction injection, harmful compliance, and over-refusal. Existing safety signals are response-level or off-policy, creating a trade-off between security and utility. A verifier that blocks an agent at the response level often also blocks legitimate use cases.

How does FATE turn failure into repair supervision?

FATE operates in three steps. First, a verifier scores complete agent trajectories and identifies failures by dimension (security, utility, over-refusal control, trajectory validity). Second, on-policy self-evolution — the same policy proposes repair candidates for those failures, which verifiers score again. Third, Pareto-Front Policy Optimization (PFPO) combines supervised warm-up with Pareto-aware optimization: it seeks a direction in policy space that increases safety without losing utility.

How large are the concrete benchmark gains?

Testing on AgentDojo, AgentHarm and ATBench yields the following numbers: 33.5% reduction in attack success rate, 82.6% reduction in harmful compliance, 6.5% improvement in external trajectory-safety diagnosis. Results hold across different models and scales, with useful behavior preserved — the Pareto-front approach eliminates the classical safety-utility trade-off.

The contribution of the paper lies in shifting verification from the response to the trajectory level and in using the failure dataset itself as a training signal — suggesting that agents learn safety best from their own mistakes, not from external labeling.

Frequently Asked Questions

What is new about the FATE approach?

FATE operates at the level of the entire agent trajectory rather than individual responses — a verifier scores failure trajectories and FATE uses those records for on-policy repair, where the same policy proposes repair candidates that verifiers score again.

What are the concrete benchmark results?

Tests on AgentDojo, AgentHarm and ATBench showed a 33.5% reduction in attack success rate, 82.6% reduction in harmful compliance, and a 6.5% improvement in external trajectory-safety diagnosis, while task utility was maintained across different models and scales.

arXiv:2605.11882: FATE framework reduces agent attack success rate by 33.5% through on-policy self-evolution

What problem do classical safety methods miss?

How does FATE turn failure into repair supervision?

How large are the concrete benchmark gains?

Frequently Asked Questions

Sources

Related news