AI safety

AI safety is an interdisciplinary field concerned with the risks of AI systems — from concrete present-day failures and misuse to longer-term existential concerns about highly capable systems.

A typical risk taxonomy:

Misuse: generation of deepfakes, disinformation, malware code, biochemical instructions; deliberate abuse
Accidents: hallucinations, systems doing the wrong thing in good faith (autonomous cars, medical advice)
Structural: concentration of power, labor-market impact, surveillance, state-level abuse
Catastrophic / existential: future systems beyond human oversight

Practical tools include: capability and dangerous-behavior evaluations, red teaming, interpretability (mechanistic analysis of internal representations), production monitoring, access controls, model cards and system cards, responsible scaling policies (Anthropic), preparedness frameworks (OpenAI), and frontier safety frameworks (Google DeepMind).

Institutionally, 2024–2026 brought a wave of national AI safety institutes (UK, US, Singapore, EU, with Croatia indirectly via the EU AISB). The EU AI Act bakes some safety requirements into law — mandatory evaluations for GPAI with systemic risk, transparency obligations, human oversight on high-risk systems.

AI safety is often conflated with AI alignment — alignment is narrower (the model’s internal objectives), while safety encompasses the entire sociotechnical system around the AI.

Sources

See also