🟡 🤖 Models Published: · 2 min read ·

Google: DiffusionGemma 26B — 4× faster text generation via diffusion approach

Editorial illustration: abstract depiction of parallel text streams forming from a diffusion cloud, digital style

DiffusionGemma is Google's 26B MoE model that generates text using a diffusion approach — in parallel rather than sequentially. It achieves more than 1,000 tokens per second on a single H100 GPU, up to 4× faster than standard autoregressive models, with a quality trade-off versus Gemma 4.

🤖

This article was generated using artificial intelligence from primary sources.

Google released DiffusionGemma — a 26B model that generates text in a fundamentally different way from all previous popular language models.

What does diffusion text generation mean?

Diffusion text generation works on the opposite principle to classic autoregressive models like GPT or standard Gemma 4. Instead of generating one token at a time — sequentially, in a loop — DiffusionGemma generates an entire block of 256 tokens in parallel in each forward pass. The result is a dramatic increase in throughput on modern GPU hardware.

How much faster is it really?

On a single NVIDIA H100 GPU the model achieves more than 1,000 tokens per second. On a consumer RTX 5090 the speed is 700+ tokens per second. According to Google’s measurements, that is up to 4× faster than comparable autoregressive models on the same GPU — a difference that is especially visible during long generations or under high throughput demands.

Accessibility and trade-offs

DiffusionGemma is available as an open-source model under the Apache 2.0 license. The quantized version fits in just 18 GB of VRAM, making it practical even on mid-range consumer cards. The model is published on Hugging Face, the Google Cloud Model Garden, and the NVIDIA NIM platform.

Architecture: MoE with 3.8B active parameters

Despite a total size of 26B parameters, DiffusionGemma uses a Mixture-of-Experts (MoE) architecture — at each inference step it activates only 3.8B parameters. This reduces compute costs per call and eases deployment on constrained resources.

The cost of speed

Google does not hide the trade-off: text quality is somewhat lower than standard Gemma 4. DiffusionGemma is designed for scenarios where throughput is critical — mass generation, streaming applications, edge inference — not for tasks that demand maximum precision. For production use cases where quality takes priority, Gemma 4 remains the better choice.

Frequently Asked Questions

What is diffusion text generation?
Diffusion text generation is an approach in which the model generates entire blocks of tokens in parallel in a single pass, unlike autoregressive models that generate one token at a time sequentially.
Is DiffusionGemma 26B equal in quality to Gemma 4?
No — Google explicitly states that quality is somewhat lower than standard Gemma 4. DiffusionGemma is optimized for speed, and the quality trade-off is a deliberate design choice.