Training
Reinforcement Learning from Human Feedback (RLHF)
A training technique in which human raters rank model responses, and those rankings are used to fine-tune an LLM toward helpfulness and safety.
RLHF (Reinforcement Learning from Human Feedback) is a training technique that uses human preferences to steer a large language model toward more helpful, safer, and more appropriate responses. The process typically has three stages:
- Supervised fine-tuning (SFT): the base model is shown a curated set of high-quality prompt-response pairs and fine-tuned to imitate them.
- Reward model training: human raters rank multiple responses to the same prompt; from those rankings a separate reward model is trained to predict how much a human would prefer a given response.
- RL optimization (usually PPO): the main LLM is further trained to maximize the reward model’s score, with a KL penalty preventing it from drifting too far from the SFT checkpoint.
OpenAI used RLHF in 2022 to produce InstructGPT and ChatGPT, turning raw base models into useful assistants. Anthropic developed a related variant — RLAIF (RL from AI Feedback) and Constitutional AI — where the rules of behavior are supplied by another model rather than a human.
RLHF is the workhorse of modern AI alignment, but has known weaknesses: it is expensive, reward models are easy to game (reward hacking), and human rankings carry rater bias. Newer methods such as DPO (Direct Preference Optimization) skip the explicit reward model.