vLLM TurboQuant study: FP8 superior for KV-cache

TurboQuant is an aggressive KV-cache quantization method at 3-4 bits that the Red Hat AI team systematically compared against the FP8 standard. Results show FP8 retains throughput and accuracy, while 3bit-nc variants lose approximately 20 percentage points on demanding reasoning benchmarks like AIME25.

Red Hat AI engineers — Eldar Kurtić, Michael Goin and Alexandre Marques — published on May 11, 2026 the first comprehensive evaluation of the TurboQuant method for KV-cache quantization in the context of the vLLM inference engine. The study compares the FP8 standard with more aggressive 3-4 bit variants on production-relevant models and benchmarks.

What is TurboQuant and how does it differ from FP8?

TurboQuant is a method that compresses only KV-cache storage to 3-4 bits and then dequantizes values back to BF16 for executing the attention computation. In contrast, FP8 quantizes both parts — storage and computation — maintaining throughput throughout the entire pipeline. Tested variants include k8v4 (8-bit keys, 4-bit values), 4bit-nc (with norm correction), and the most aggressive 3bit-nc.

What memory capacity and throughput results does the study show?

On Llama-3.3-70B, Qwen3-30B and MiniMax-M2.7 models, measured savings were: FP8 delivers 2× capacity, k8v4 2.4×, and 4bit-nc 3.4×. But throughput drops — TurboQuant variants retain 66-80% of baseline speed, with latency slowdowns of 10-68% depending on batch size. FP8 retains full baseline throughput.

How large is the accuracy loss on reasoning tasks?

On AIME25, GPQA:Diamond, MATH500 and LiveCodeBench-v6 benchmarks, the most aggressive variants (3bit-nc, k3v4-nc) lose about 20 percentage points. The less aggressive 4bit-nc loses only 1-4 points. Long-context evaluation on openai/mrcr (up to 256k tokens) showed a similar pattern.

The conclusion is unambiguous: “FP8 remains the best default for KV-cache quantization.” It provides 2× capacity without any throughput or accuracy loss. More aggressive TurboQuant variants only make sense in scenarios of extremely limited memory where 4bit-nc offers 3.4× savings at minimal accuracy cost.

Frequently Asked Questions

What is KV-cache and why is it quantized?

KV-cache (key-value cache) is a memory structure in transformer models that stores intermediate attention layer results for previous tokens. Quantizing the KV-cache (reducing precision from FP16/BF16 to 8 or fewer bits) significantly reduces VRAM usage and enables longer context windows, but can affect generation quality.

Why does FP8 outperform more aggressive TurboQuant variants?

FP8 quantizes both storage and the attention computation itself, while TurboQuant variants compress only storage and dequantize to BF16 for computation. The dequantization cost scales with batch size, causing 10-68% slowdown, while aggressive 3-bit variants lose the precision needed for mathematical reasoning.

When is 4bit-nc worth using despite the accuracy loss?

The 4bit-nc variant loses only 1-4 points on benchmarks with 3.4× memory savings — an acceptable tradeoff for scenarios with extreme memory constraints, such as serving very long contexts (256k tokens) on smaller GPUs where FP8 doesn't fit.

vLLM: TurboQuant study shows FP8 remains superior for KV-cache — 3bit-nc drops ~20 pp

What is TurboQuant and how does it differ from FP8?

What memory capacity and throughput results does the study show?

How large is the accuracy loss on reasoning tasks?

Frequently Asked Questions

Sources

Related news

What is TurboQuant and how does it differ from FP8?

What memory capacity and throughput results does the study show?

How large is the accuracy loss on reasoning tasks?

What does Red Hat AI recommend for production deployments?

Frequently Asked Questions

Sources

Related news