Training

Knowledge distillation

Compression technique where a smaller student model learns to mimic the outputs of a larger teacher model, shrinking size while preserving accuracy.

Knowledge distillation is a model-compression technique in which a smaller “student” network is trained to imitate a larger “teacher” network. Rather than learning only from the hard labels in a dataset, the student is trained on the soft probability distributions produced by the teacher — these carry much richer information about how the teacher generalizes.

The technique was popularized by Hinton, Vinyals, and Dean in 2015. A typical pipeline:

  • A large large language model (the teacher) generates outputs or logits over a large prompt set
  • A smaller student is trained to minimize the divergence between its outputs and the teacher’s, often using a “temperature softmax” for a smoother distribution
  • Optionally combined with classic fine-tuning on ground-truth labels

Distillation is why practical small versions of powerful models exist today — e.g. DistilBERT (40% smaller than BERT at 97% of the performance), Llama 3.2 1B/3B, Gemma 2B, and numerous local distillations of GPT-4 and Claude. Apple Intelligence and on-device phone models lean heavily on distillation to fit large-model capability into a few gigabytes of RAM.

The limitation is that students rarely match teachers on edge cases and complex reasoning, and quality depends heavily on the diversity of prompts used during transfer.

Sources

See also