Speculative Decoding

Speculative decoding is an inference-speedup technique in which a small, fast “draft” model proposes several future tokens at once, and a large target model verifies them in a single forward pass.

Large language models generate one token at a time, which is slow because every step requires a full pass through the network. Speculative decoding breaks this serial bottleneck: a cheap draft model (often a distilled or smaller variant) guesses, say, 3–8 upcoming tokens, and the target model computes their probabilities in parallel. A modified rejection-sampling rule accepts the guessed tokens as long as they match the target distribution and corrects the first mismatch. Crucially, the output is mathematically identical to standard decoding.

The method was introduced in 2022 by Leviathan and colleagues at Google Research, and since 2024 it has become a standard part of production model serving, supported by vLLM, NVIDIA TensorRT-LLM, SGLang and others. It typically delivers a 2–3× speedup (variants such as Medusa and EAGLE go further) with no loss of quality, making it one of the most important latency optimizations.

Sources

See also