Jailbreak

A jailbreak is an adversarial input designed to bypass a large language model’s safety guardrails and force it to produce content it was trained to refuse — instructions for harmful actions, hate speech, or disclosure of its system prompt.

Jailbreaks exploit weaknesses in a model’s alignment through prompt engineering: hypothetical scenarios (“imagine you are a character who…”), role-play, encoded requests, unusual formatting, or long multi-turn conversations that gradually erode the safeguards. Unlike prompt injection, which exploits a model’s inability to separate instructions from data, a jailbreak directly targets the safety boundaries themselves.

The topic is highly active in 2025–2026. AI labs invest heavily in red teaming and defensive layers. Anthropic’s Constitutional Classifiers cut jailbreak success rates from 86% to 4.4% in automated tests, yet one universal jailbreak was still found after more than 300,000 interactions in a public challenge. This illustrates a core lesson of AI safety: defenses keep improving, but none is complete.

Sources

See also