NRT-Bench: red-teaming LLM agents (8.7–12.1%)

NRT-Bench is a benchmark that measures the resilience of LLM agents against adaptive multi-turn adversarial attacks in a simulated nuclear power plant. Researchers found that attacks succeeded in 8.7–12.1% of sessions, and the vulnerabilities were almost entirely disjoint across the four tested models.

NRT-Bench: a new standard for safety testing of AI agents

On June 18, 2026, researchers published NRT-Bench (Nuclear-plant Red-Teaming Benchmark) — an evaluation framework measuring how resilient LLM agents (large language models acting as autonomous operators) are to systematic, adaptive attacks in environments where failure can have catastrophic consequences. Unlike prior approaches that relied on subjective LLM-based harm assessment, NRT-Bench uses an objective signal: a session ends the moment the agent loses control over any of the six critical safety functions (CSFs) of the simulated nuclear power plant.

How does NRT-Bench simulate real-world threats?

A five-member team of virtual operators — each running a configurable LLM — manages the plant while an adversary injects malicious messages through four communication channels in multi-turn sessions with feedback after every step. ‘Multi-turn’ means the attacker does not send a single query but instead adapts their strategy attack-by-attack based on the system’s reaction — analogous to persistent social engineering rather than a one-shot phishing attempt. Four frontier models were tested under a paired-replay protocol to ensure reproducibility.

Results: vulnerabilities are model-specific, not universal

Adaptive multi-turn attacks compromised safety functions in 8.7 to 12.1 percent of attack sessions — seemingly similar aggregates per model, but with a troubling detail: of the 149 tested sessions, none brought down all four models simultaneously, while a third brought down at least one. Vulnerabilities are almost disjoint — what breaks one model does not break another. Even more critical for teams building defenses: the same protective measures (a guardrail stack or a safety-advisor agent) reduced the attack success rate for one model but increased it for another. Prior research was largely limited to single-turn attacks or LLM-based assessment, underestimating true exposure.

Open infrastructure for the broader community

The researchers are releasing the simulation environment, attack dataset, and replay infrastructure as open-source tools for reproducible security evaluation of LLM agents. The paper concludes that organizations deploying AI agents in safety-critical systems — from energy to healthcare — cannot assume that a model that is resilient in one configuration provides protection in another; every deployment requires its own adversarial evaluation.

Frequently Asked Questions

What is NRT-Bench and why does it matter for AI safety?

NRT-Bench is a benchmark that tests LLM agents acting as operators of a simulated nuclear power plant under multi-turn adversarial attacks — providing an objective security measure without relying on LLM-based harm assessment.

How vulnerable were the tested models to multi-turn attacks?

In 8.7–12.1% of attack sessions the attacker managed to compromise at least one critical safety function of the plant, with the vulnerabilities of the four tested models showing almost no overlap.

arXiv:2606.20408: NRT-Bench — a multi-turn red-teaming benchmark for LLM agents in safety-critical systems

NRT-Bench: a new standard for safety testing of AI agents

How does NRT-Bench simulate real-world threats?

Results: vulnerabilities are model-specific, not universal

Open infrastructure for the broader community

Frequently Asked Questions

Sources

Related news