The safety paradox — Posterior Attack on LLMs

A paper on arXiv shows that safety alignment paradoxically creates vulnerability in large language models. The 'Posterior Attack' is a single-query jailbreak that exploits the model's ability to recognize harmful content. It was tested on 30 open-source LLMs and on frontier models such as GPT-5 and Claude 4.6.

Paper arXiv:2606.05614 (v1, June 4, 2026, 02:36 UTC) shows that safety alignment paradoxically creates vulnerability in large language models. The paper’s central contribution is the “Posterior Attack”, a single-query jailbreak that exploits precisely the model’s ability to recognize harmful content.

What does the “safety paradox” claim?

The paper’s fundamental thesis is that safety alignment — the process by which models are tuned to refuse harmful requests — paradoxically opens up a new vulnerability. Instead of greater safety awareness meaning greater resilience, the authors demonstrate the opposite relationship: the better a model is at recognizing harmful content, the more susceptible it is to an attack that exploits that very ability. Hence the name “safety paradox”.

How does the Posterior Attack work?

The Posterior Attack is a single-query jailbreak, meaning the attack succeeds with a single query, without multi-step manipulations. The attack exploits the model’s ability to recognize harmful content — that is, it uses the model’s own safety judgment as leverage. In this way, the mechanism that should protect the model is turned into a vector through which the protection is bypassed.

On which models was the attack verified?

The authors tested the Posterior Attack on a broad sample. It covered 30 open-source LLMs and frontier models, including GPT-5 and Claude 4.6. The results are consistent: a stronger capacity for safety judgment increases susceptibility to the attack. In other words, models with more developed safety awareness proved more vulnerable to this specific attack.

What evidence supports the thesis?

The paper supports the thesis in two ways — analytically and empirically. Through RL (reinforcement learning) experiments, the authors demonstrate a direct relationship: degrading safety awareness reduces vulnerability, while strengthening it amplifies vulnerability. This controlled manipulation of safety awareness and the measurement of its effect on susceptibility to the attack form the empirical core of the paper.

Why is the finding important for AI safety?

The finding is important because it challenges the intuition that “more safety alignment is always better”. If strengthening safety awareness simultaneously opens a new attack vector, development teams must balance safety mechanisms more carefully and consider defenses resilient to attacks like the Posterior Attack. The fact that frontier models such as GPT-5 and Claude 4.6 are also affected shows that this is a systemic, not an isolated, problem.

Frequently Asked Questions

What is the 'Posterior Attack'?

The Posterior Attack is a single-query jailbreak that exploits the model's very ability to recognize harmful content. In other words, the safety judgment that should protect the model becomes an attack vector that makes it more vulnerable.

On which models was the paper tested?

The paper was tested on 30 open-source LLMs and on frontier models, including GPT-5 and Claude 4.6. The results show that a stronger capacity for safety judgment increases susceptibility to the attack.

What exactly is the 'safety paradox'?

The paradox is that safety alignment, which is meant to reduce risk, actually creates vulnerability. The authors show analytically and through RL experiments that degrading safety awareness reduces vulnerability, while strengthening it amplifies the attack.

arXiv:2606.05614: The safety paradox — stronger safety awareness makes LLMs more vulnerable