Safety

Red team (AI)

Structured adversarial testing of AI systems — prompt injection, jailbreaks, misuse — designed to surface vulnerabilities before production launch.

A red team is the practice of deliberately attacking your own AI system with skilled testers in order to surface scenarios in which the system does something dangerous, misaligned, or undesired — before a malicious outsider or a journalist on a deadline finds it first.

The term is borrowed from military and cybersecurity practice, but AI red teaming has its own specifics:

  • Prompt injection — getting the model to ignore its system prompt or reveal hidden instructions
  • Jailbreak — bypassing safety training so the model produces content it would normally refuse (weapons, infrastructure attack, illegal advice)
  • Capability elicitation — testing whether the model can perform a dangerous task at all when carefully scaffolded
  • Misuse scenarios — phishing emails, malicious code, disinformation
  • Bias and fairness — generating stereotypes or discriminatory decisions

Major labs (OpenAI, Anthropic, Google DeepMind) now routinely publish “system cards” detailing the red-team protocol for each frontier model. The EU AI Act and UK AISI evaluations require red teaming for GPAI with systemic risk.

Red teaming complements rather than replaces AI safety and alignment techniques — what red teamers miss, real-world users or attackers will find. An entire industry of specialized red-team firms and bug-bounty programs has grown up since 2023.

Sources

See also