Google: Gemini Nano on Pixel is 50%+ faster with frozen multi-token prediction
Google accelerated Gemini Nano inference on Pixel 9 and 10 by more than 50% using frozen multi-token prediction — a technique that generates an average of roughly 2 tokens per model pass, saving 130 MB of memory per instance with no change to output results.
This article was generated using artificial intelligence from primary sources.
How does the frozen MTP head accelerate Gemini Nano?
Multi-token prediction (MTP) is a technique where a model generates proposals for several tokens ahead in a single pass, instead of the standard approach that produces only one token per call. Google applied a frozen variant: the MTP head cross-attends to the frozen KV cache (the temporary key-value memory) of the main model, without separate computation for a drafter. The result — an average of roughly 2 additional tokens per pass — is bit-for-bit identical to the output of the original model.
How much faster and cheaper on-device?
Inference speedup on Pixel 9 is more than 50% compared to standalone drafter models — separate, smaller networks that previously served as auxiliary proposal generators. Alongside speed, the architecture delivers a memory saving of 130 MB per instance, which is critical on mobile devices with limited RAM. For predictable structures such as smart replies, the acceptance rate of proposed tokens is 55% higher than with the standard approach.
Zero-copy architecture and deployment on Pixel
Google described the approach as a zero-copy architecture: the MTP head shares the KV cache with the main model without copying intermediate results, eliminating one of the main sources of memory and computational overhead in speculative decoding. The technique is already deployed on Pixel 9 and Pixel 10 for two features: AI Notification Summaries and Proofread (text proofreading). Both use a local, on-device model without sending data to the cloud.
Broader context: on-device AI without compromise
Until now, inference speedups on mobile devices often required separate, smaller drafter models that introduce additional memory footprint and sometimes different outputs. Google’s approach shows that a frozen MTP head can be integrated into existing Gemini Nano without fine-tuning from scratch and without loss of accuracy — a step toward on-device AI that is both fast and faithful to the original model behavior.
Frequently Asked Questions
- What is multi-token prediction and how does it differ from standard generation?
- Standard language models generate one token per call; multi-token prediction (MTP) uses additional heads that in a single pass propose several tokens ahead, which the main model accepts or rejects — the result is identical, but inference is faster.
- Why is the MTP head frozen and what does that mean in practice?
- Freezing means the MTP head weights are not trained together with the main model but learned once and kept fixed; this allows sharing the KV cache with the main model without recomputing, delivering both speedup and memory savings.