Infrastructure

Speculative Decoding

An inference speedup where a small draft model proposes several tokens at once and the large model verifies them in parallel, keeping the same output.

Speculative decoding is an inference-speedup technique in which a small, fast “draft” model proposes several future tokens at once, and a large target model verifies them in a single forward pass.

Large language models generate one token at a time, which is slow because every step requires a full pass through the network. Speculative decoding breaks this serial bottleneck: a cheap draft model (often a distilled or smaller variant) guesses, say, 3–8 upcoming tokens, and the target model computes their probabilities in parallel. A modified rejection-sampling rule accepts the guessed tokens as long as they match the target distribution and corrects the first mismatch. Crucially, the output is mathematically identical to standard decoding.

The method was introduced in 2022 by Leviathan and colleagues at Google Research, and since 2024 it has become a standard part of production model serving, supported by vLLM, NVIDIA TensorRT-LLM, SGLang and others. It typically delivers a 2–3× speedup (variants such as Medusa and EAGLE go further) with no loss of quality, making it one of the most important latency optimizations.

Sources

See also