Guardrails — Glossary | 24 AI

Guardrails are controls that constrain the inputs and outputs of an AI system so it stays within permitted boundaries. Unlike safety baked into the model through training, guardrails are a separate layer that sits around the model — between the user and the large language model.

They typically combine input and output checks: content classifiers (hate, violence, self-harm), personally identifiable information (PII) detection, topic filters, and detectors for jailbreaks and prompt injection. A failed check blocks or rewrites the response before it reaches the user.

Through 2025–2026, guardrails have become a standard product. Anthropic’s “Constitutional Classifiers” filter the overwhelming majority of jailbreaks with minimal over-refusals, while OpenAI ships a configurable Guardrails framework with checks for moderation, PII, and prompt injection. Because guardrails are probabilistic and bypassable, they complement — rather than replace — AI safety work and rigorous evaluation.

Sources