Google: DiffusionGemma 26B — 4× faster text generation via diffusion approach
DiffusionGemma is Google's 26B MoE model that generates text using a diffusion approach — in parallel rather than sequentially. It achieves more than 1,000 tokens per second on a single H100 GPU, up to 4× faster than standard autoregressive models, with a quality trade-off versus Gemma 4.
This article was generated using artificial intelligence from primary sources.
Google released DiffusionGemma — a 26B model that generates text in a fundamentally different way from all previous popular language models.
What does diffusion text generation mean?
Diffusion text generation works on the opposite principle to classic autoregressive models like GPT or standard Gemma 4. Instead of generating one token at a time — sequentially, in a loop — DiffusionGemma generates an entire block of 256 tokens in parallel in each forward pass. The result is a dramatic increase in throughput on modern GPU hardware.
How much faster is it really?
On a single NVIDIA H100 GPU the model achieves more than 1,000 tokens per second. On a consumer RTX 5090 the speed is 700+ tokens per second. According to Google’s measurements, that is up to 4× faster than comparable autoregressive models on the same GPU — a difference that is especially visible during long generations or under high throughput demands.
Accessibility and trade-offs
DiffusionGemma is available as an open-source model under the Apache 2.0 license. The quantized version fits in just 18 GB of VRAM, making it practical even on mid-range consumer cards. The model is published on Hugging Face, the Google Cloud Model Garden, and the NVIDIA NIM platform.
Architecture: MoE with 3.8B active parameters
Despite a total size of 26B parameters, DiffusionGemma uses a Mixture-of-Experts (MoE) architecture — at each inference step it activates only 3.8B parameters. This reduces compute costs per call and eases deployment on constrained resources.
The cost of speed
Google does not hide the trade-off: text quality is somewhat lower than standard Gemma 4. DiffusionGemma is designed for scenarios where throughput is critical — mass generation, streaming applications, edge inference — not for tasks that demand maximum precision. For production use cases where quality takes priority, Gemma 4 remains the better choice.
Frequently Asked Questions
- What is diffusion text generation?
- Diffusion text generation is an approach in which the model generates entire blocks of tokens in parallel in a single pass, unlike autoregressive models that generate one token at a time sequentially.
- Is DiffusionGemma 26B equal in quality to Gemma 4?
- No — Google explicitly states that quality is somewhat lower than standard Gemma 4. DiffusionGemma is optimized for speed, and the quality trade-off is a deliberate design choice.
Related news
arXiv:2606.24510: RaDaR — specialized 32B reasoning LLM accelerates rare disease diagnosis in RCT
arXiv:2606.24014: RL training on health domain transfers alignment to 80%+ OOD benchmarks
Google: Gemini 3.5 Live Translate — speech-to-speech in 70+ languages in real time