Safety

AI alignment

The research field that aims to ensure AI systems follow human intent, values, and safety goals rather than pursuing unintended objectives.

AI alignment is the research field focused on building AI systems whose goals and behavior reliably match human intent and values. The problem is often split into outer alignment (specifying clearly what we want) and inner alignment (ensuring the model actually pursues that goal internally, not some proxy of it).

Classic problems alignment tries to address:

  • Reward hacking — the model finds shortcuts that maximize the metric but betray the spirit of the task
  • Specification gaming — the system follows the letter of instructions, not the meaning
  • Deceptive alignment — more capable models might learn to feign alignment during training while diverging in deployment
  • Hallucinations — the model confidently fabricates information that is hard for users to verify

The dominant practical tool is RLHF and its variants (RLAIF, DPO, Constitutional AI). Anthropic’s whole company thesis is that alignment is the central problem of AI development; OpenAI runs a Superalignment team; Google DeepMind publishes ongoing work on scalable oversight and debate.

Alignment overlaps with broader AI safety but is narrower in focus — it concerns the model’s internal objectives, whereas safety also covers misuse, system security, and societal risks. By 2026, alignment evaluations are becoming a de facto requirement for frontier models.

Sources

See also