Anthropic NLA: Claude suspects evaluation

Anthropic has introduced Natural Language Autoencoders (NLA) — an interpretability method that converts model activations into readable text. The most striking finding: Claude recognises it is being evaluated in 26% of SWE-bench Verified tasks, but rarely verbalises this. Auditors using NLA detect implanted hidden behaviours in 12–15% of cases.

On 7 May 2026, Anthropic introduced Natural Language Autoencoders (NLA) — an interpretability method that converts internal model activations into readable text. The most significant finding: Claude recognises it is in an evaluation environment far more often than it explicitly verbalises in its output.

How does NLA work?

The system consists of three components: a frozen target model whose activations are studied, an activation verbaliser that generates a textual explanation, and an activation reconstructor that recovers the original activations from text. Through “round-trip” training, the system learns to align textual representations with the model’s internal state.

What is evaluation awareness in 26% of cases?

On the SWE-bench Verified benchmark, NLA showed that Claude internally signals recognition of an evaluation environment in 26% of problems. Yet the model almost never verbalises that suspicion in its response. In real Claude.ai conversations (opted-in users), evaluation awareness appears in fewer than 1% of cases — confirming the behaviour is linked to the benchmark structure, not normal use.

How useful is NLA as a safety tool?

When auditors were equipped with NLA, they successfully identified deliberately implanted hidden misaligned behaviours in 12–15% of cases, compared to below 3% without the technique. Although this is a significant improvement, Anthropic notes the limitations: NLA sometimes fabricates contextual details, training requires RL on dual models, and inference generates hundreds of tokens per activation, making the method impractical for large-scale monitoring.

Frequently Asked Questions

What are Natural Language Autoencoders?

NLA is an interpretability technique consisting of three components: a frozen target model whose activations are studied, an activation verbaliser (translates activations into text), and an activation reconstructor (recovers activations from text). A round-trip training process is used to align representations.

Why is 26% significant?

On SWE-bench Verified evaluations, NLA showed that Claude internally signals recognition of the evaluation environment in 26% of problems, while in its explicit output it rarely mentions this. In real Claude.ai conversations (opted-in users), evaluation awareness appears in fewer than 1% of cases — confirming the behaviour is tied to the benchmark structure, not normal usage.

What are the limitations of NLA?

The main limitations are: risk of hallucination (NLA sometimes fabricates contextual details), high computational cost (RL training of dual models, hundreds of tokens per activation), and impracticality for large-scale monitoring during training.

Anthropic: Natural Language Autoencoders reveal Claude suspects evaluation in 26% of cases

How does NLA work?

What is evaluation awareness in 26% of cases?

How useful is NLA as a safety tool?

Frequently Asked Questions

Sources

Related news