🟢 🏥 In Practice Published: · 2 min read ·

arXiv:2606.20474: UltraQuant Reduces KV Cache Latency by 3.47× with 4-Bit Precision

arXiv:2606.20474 ↗

Editorial illustration: UltraQuant reduces KV cache latency by 3.47× with 4-bit precision

UltraQuant is a KV cache compression technique targeting 4-bit precision for multi-turn LLM agents. Developed at AMD, UCLA, and Purdue, it achieves 3.47× faster P50 TTFT in late rounds under high context pressure and 1.63× higher output throughput compared to the FP8 baseline.

🤖

This article was generated using artificial intelligence from primary sources.

UltraQuant: 4-Bit KV Cache Compression for Agentic LLM Workflows

Inesh Chakrabarti and colleagues from AMD, UCLA, and Purdue University have published UltraQuant — a system that compresses the KV cache (key-value cache) of multi-turn LLM agents from FP16/FP8 down to 4-bit precision (FP4), significantly reducing memory bandwidth pressure.

Asymmetric Approach: FP8 Queries, FP4 Keys and Values

The KV cache is a bottleneck in long agentic conversations because it grows linearly with context length. UltraQuant introduces asymmetric treatment: queries remain in FP8, while keys and values are quantized to FP4 via Walsh-Hadamard rotation, which redistributes outliers and reduces quantization error. AMD GPUs with native scaled-MFMA support execute FP4 matrix multiplications in hardware, without software emulation.

Results: 3.47× Faster TTFT in Late Rounds

On AMD hardware with scaled-MFMA enabled:

  • P50 TTFT (time-to-first-token) in late rounds under high context pressure: 3.47× faster vs. FP8 baseline
  • Average TTFT improvement across all rounds: 2.3×
  • Output throughput: 1.63× more tokens per second

By comparison, standard FP8 quantization typically delivers a 1.3–1.5× TTFT speedup while requiring higher memory capacity. UltraQuant is especially effective in high-turn-count agents (multi-turn), where the KV cache of late rounds becomes the dominant bottleneck.

Applications: Multi-Turn Agentic Scenarios

The work targets production scenarios such as chatbots, RAG pipelines, and coding agents where context length grows with each exchange. The authors emphasize that UltraQuant is complementary to techniques such as GQA (grouped-query attention) and PagedAttention, with which it can be combined.

The paper was submitted on June 18, 2026, and published on June 19 on arXiv (arXiv:2606.20474).

Frequently Asked Questions

What is a KV cache and why does it matter for LLM agents?
The KV (key-value) cache stores intermediate attention results in transformers to avoid recomputing them at every new token — critical for multi-turn agentic conversations with long context.
Which GPUs does UltraQuant support, and does it require special hardware?
UltraQuant uses AMD GPUs with native support for scaled-MFMA operations (natively FP4), enabling full hardware acceleration without software emulation.
How does UltraQuant treat keys and values differently?
It applies an asymmetric approach: queries remain in FP8 precision, while keys and values are compressed to FP4 using Walsh-Hadamard rotation, which redistributes outliers and reduces quantization error.