vLLM: TurboQuant study shows FP8 remains superior for KV-cache — 3bit-nc drops ~20 pp
TurboQuant is an aggressive KV-cache quantization method at 3-4 bits that the Red Hat AI team systematically compared against the FP8 standard. Results show FP8 retains throughput and accuracy, while 3bit-nc variants lose approximately 20 percentage points on demanding reasoning benchmarks like AIME25.
This article was generated using artificial intelligence from primary sources.
Red Hat AI engineers — Eldar Kurtić, Michael Goin and Alexandre Marques — published on May 11, 2026 the first comprehensive evaluation of the TurboQuant method for KV-cache quantization in the context of the vLLM inference engine. The study compares the FP8 standard with more aggressive 3-4 bit variants on production-relevant models and benchmarks.
What is TurboQuant and how does it differ from FP8?
TurboQuant is a method that compresses only KV-cache storage to 3-4 bits and then dequantizes values back to BF16 for executing the attention computation. In contrast, FP8 quantizes both parts — storage and computation — maintaining throughput throughout the entire pipeline. Tested variants include k8v4 (8-bit keys, 4-bit values), 4bit-nc (with norm correction), and the most aggressive 3bit-nc.
What memory capacity and throughput results does the study show?
On Llama-3.3-70B, Qwen3-30B and MiniMax-M2.7 models, measured savings were: FP8 delivers 2× capacity, k8v4 2.4×, and 4bit-nc 3.4×. But throughput drops — TurboQuant variants retain 66-80% of baseline speed, with latency slowdowns of 10-68% depending on batch size. FP8 retains full baseline throughput.
How large is the accuracy loss on reasoning tasks?
On AIME25, GPQA:Diamond, MATH500 and LiveCodeBench-v6 benchmarks, the most aggressive variants (3bit-nc, k3v4-nc) lose about 20 percentage points. The less aggressive 4bit-nc loses only 1-4 points. Long-context evaluation on openai/mrcr (up to 256k tokens) showed a similar pattern.
What does Red Hat AI recommend for production deployments?
The conclusion is unambiguous: “FP8 remains the best default for KV-cache quantization.” It provides 2× capacity without any throughput or accuracy loss. More aggressive TurboQuant variants only make sense in scenarios of extremely limited memory where 4bit-nc offers 3.4× savings at minimal accuracy cost.
Frequently Asked Questions
- What is KV-cache and why is it quantized?
- KV-cache (key-value cache) is a memory structure in transformer models that stores intermediate attention layer results for previous tokens. Quantizing the KV-cache (reducing precision from FP16/BF16 to 8 or fewer bits) significantly reduces VRAM usage and enables longer context windows, but can affect generation quality.
- Why does FP8 outperform more aggressive TurboQuant variants?
- FP8 quantizes both storage and the attention computation itself, while TurboQuant variants compress only storage and dequantize to BF16 for computation. The dequantization cost scales with batch size, causing 10-68% slowdown, while aggressive 3-bit variants lose the precision needed for mathematical reasoning.
- When is 4bit-nc worth using despite the accuracy loss?
- The 4bit-nc variant loses only 1-4 points on benchmarks with 3.4× memory savings — an acceptable tradeoff for scenarios with extreme memory constraints, such as serving very long contexts (256k tokens) on smaller GPUs where FP8 doesn't fit.
Related news
AMD: Gluon block-level model enables GEMM kernels with 5.255 TFLOPS MXFP4 on Instinct MI355
AMD: ROCm 7.13 brings MI350P GPU, multi-VF virtualisation and TheRock packaging
AMD ROCm: BubbleFence partitions video streams using Vision Foundation model embeddings instead of metadata heuristics