Training
Reinforcement Learning
A training paradigm where an agent learns to make decisions by interacting with an environment, guided by reward signals; the basis of RLHF and reasoning-model training.
Reinforcement Learning (RL) is a machine-learning paradigm in which an agent learns to make decisions through trial and error within an environment. At each step the agent observes a state, takes an action, and receives a scalar reward signal; the goal is to learn a policy that maximizes cumulative reward over time. Unlike supervised learning, there is no labeled “correct answer” — the model learns purely from the consequences of its own actions.
RL powered robotics and game-playing for decades (AlphaGo, Atari), but today it is pivotal for large language models. Within RLHF, RL turns a base model into a helpful assistant by optimizing against a reward model learned from human preferences.
Since 2024-2026, RL has become the engine of reasoning models: trained on verifiable tasks (math, code) with a reward for the correct final answer, systems like OpenAI o1/o3 and DeepSeek-R1 develop a long chain of thought without human-labeled examples. Key challenges remain reward hacking and training instability.