Safety
Constitutional AI
Anthropic's method for aligning models using a written set of principles (a constitution) plus AI feedback (RLAIF), rather than human labels of harmful outputs.
Constitutional AI (CAI) is a language-model alignment method developed by Anthropic. Instead of relying on humans to hand-label harmful outputs, the model is steered by a written set of principles — a “constitution” — together with feedback supplied by the AI itself.
The process runs in two phases. In the supervised phase, the model generates responses, then critiques and revises them against the constitution’s principles, and is fine-tuned on those revisions. In the reinforcement-learning phase, the model compares pairs of responses and picks the one that better follows the constitution; those AI-generated preferences train a reward model. This variant is called RLAIF (RL from AI Feedback) and directly mirrors RLHF, but removes humans from the rating loop.
The constitution draws principles from sources such as the UN Universal Declaration of Human Rights. CAI underpins the training of Claude and remains a live AI safety topic because it enables scalable oversight and makes behavioral rules explicit and inspectable.