Training

Self-supervised learning

Training approach where a model learns from unlabeled data by creating its own targets, such as predicting hidden tokens within a sentence.

Self-supervised learning (SSL) is a machine-learning paradigm in which a model learns from unlabeled data by constructing its own supervision signal. Instead of relying on human-curated labels, part of the input is hidden or corrupted and the model is trained to predict the missing piece from the surrounding context.

The best-known examples are masked language modeling in BERT (mask a word, predict it) and next-token prediction in the GPT family (predict the next token). In vision, models like SimCLR and DINO learn to relate different views of the same image.

Why it matters:

  • Eliminates the bottleneck of manual labeling — the internet, books, and code already exist at scale
  • Underlies the pretraining of nearly every modern large language model and foundation model
  • The resulting representations are then fine-tuned for specific tasks using far fewer labeled examples

Yann LeCun calls SSL “the dark matter of intelligence” because humans and animals largely learn this way — by observing the world without explicit labels. SSL is the reason today’s AI systems can scale from millions to trillions of parameters without a proportional rise in labeling cost.

Sources

See also