Training
Knowledge distillation
Compression technique where a smaller student model learns to mimic the outputs of a larger teacher model, shrinking size while preserving accuracy.
Knowledge distillation is a model-compression technique in which a smaller “student” network is trained to imitate a larger “teacher” network. Rather than learning only from the hard labels in a dataset, the student is trained on the soft probability distributions produced by the teacher — these carry much richer information about how the teacher generalizes.
The technique was popularized by Hinton, Vinyals, and Dean in 2015. A typical pipeline:
- A large large language model (the teacher) generates outputs or logits over a large prompt set
- A smaller student is trained to minimize the divergence between its outputs and the teacher’s, often using a “temperature softmax” for a smoother distribution
- Optionally combined with classic fine-tuning on ground-truth labels
Distillation is why practical small versions of powerful models exist today — e.g. DistilBERT (40% smaller than BERT at 97% of the performance), Llama 3.2 1B/3B, Gemma 2B, and numerous local distillations of GPT-4 and Claude. Apple Intelligence and on-device phone models lean heavily on distillation to fit large-model capability into a few gigabytes of RAM.
The limitation is that students rarely match teachers on edge cases and complex reasoning, and quality depends heavily on the diversity of prompts used during transfer.