Safety

Guardrails

Safety controls and filters that constrain an AI model's inputs and outputs — content classifiers, policy filters, and attack detectors placed around the model.

Guardrails are controls that constrain the inputs and outputs of an AI system so it stays within permitted boundaries. Unlike safety baked into the model through training, guardrails are a separate layer that sits around the model — between the user and the large language model.

They typically combine input and output checks: content classifiers (hate, violence, self-harm), personally identifiable information (PII) detection, topic filters, and detectors for jailbreaks and prompt injection. A failed check blocks or rewrites the response before it reaches the user.

Through 2025–2026, guardrails have become a standard product. Anthropic’s “Constitutional Classifiers” filter the overwhelming majority of jailbreaks with minimal over-refusals, while OpenAI ships a configurable Guardrails framework with checks for moderation, PII, and prompt injection. Because guardrails are probabilistic and bypassable, they complement — rather than replace — AI safety work and rigorous evaluation.

Sources