Reinforcement Learning from Human Feedback (RLHF)

RLHF (Reinforcement Learning from Human Feedback) is a training technique that uses human preferences to steer a large language model toward more helpful, safer, and more appropriate responses. The process typically has three stages:

Supervised fine-tuning (SFT): the base model is shown a curated set of high-quality prompt-response pairs and fine-tuned to imitate them.
Reward model training: human raters rank multiple responses to the same prompt; from those rankings a separate reward model is trained to predict how much a human would prefer a given response.
RL optimization (usually PPO): the main LLM is further trained to maximize the reward model’s score, with a KL penalty preventing it from drifting too far from the SFT checkpoint.

OpenAI used RLHF in 2022 to produce InstructGPT and ChatGPT, turning raw base models into useful assistants. Anthropic developed a related variant — RLAIF (RL from AI Feedback) and Constitutional AI — where the rules of behavior are supplied by another model rather than a human.

RLHF is the workhorse of modern AI alignment, but has known weaknesses: it is expensive, reward models are easy to game (reward hacking), and human rankings carry rater bias. Newer methods such as DPO (Direct Preference Optimization) skip the explicit reward model.

Sources

See also