Training
Self-supervised learning
Training approach where a model learns from unlabeled data by creating its own targets, such as predicting hidden tokens within a sentence.
Self-supervised learning (SSL) is a machine-learning paradigm in which a model learns from unlabeled data by constructing its own supervision signal. Instead of relying on human-curated labels, part of the input is hidden or corrupted and the model is trained to predict the missing piece from the surrounding context.
The best-known examples are masked language modeling in BERT (mask a word, predict it) and next-token prediction in the GPT family (predict the next token). In vision, models like SimCLR and DINO learn to relate different views of the same image.
Why it matters:
- Eliminates the bottleneck of manual labeling — the internet, books, and code already exist at scale
- Underlies the pretraining of nearly every modern large language model and foundation model
- The resulting representations are then fine-tuned for specific tasks using far fewer labeled examples
Yann LeCun calls SSL “the dark matter of intelligence” because humans and animals largely learn this way — by observing the world without explicit labels. SSL is the reason today’s AI systems can scale from millions to trillions of parameters without a proportional rise in labeling cost.