arXiv:2606.20474: UltraQuant Reduces KV Cache Latency by 3.47× with 4-Bit Precision
UltraQuant is a KV cache compression technique targeting 4-bit precision for multi-turn LLM agents. Developed at AMD, UCLA, and Purdue, it achieves 3.47× faster P50 TTFT in late rounds under high context pressure and 1.63× higher output throughput compared to the FP8 baseline.
This article was generated using artificial intelligence from primary sources.
UltraQuant: 4-Bit KV Cache Compression for Agentic LLM Workflows
Inesh Chakrabarti and colleagues from AMD, UCLA, and Purdue University have published UltraQuant — a system that compresses the KV cache (key-value cache) of multi-turn LLM agents from FP16/FP8 down to 4-bit precision (FP4), significantly reducing memory bandwidth pressure.
Asymmetric Approach: FP8 Queries, FP4 Keys and Values
The KV cache is a bottleneck in long agentic conversations because it grows linearly with context length. UltraQuant introduces asymmetric treatment: queries remain in FP8, while keys and values are quantized to FP4 via Walsh-Hadamard rotation, which redistributes outliers and reduces quantization error. AMD GPUs with native scaled-MFMA support execute FP4 matrix multiplications in hardware, without software emulation.
Results: 3.47× Faster TTFT in Late Rounds
On AMD hardware with scaled-MFMA enabled:
- P50 TTFT (time-to-first-token) in late rounds under high context pressure: 3.47× faster vs. FP8 baseline
- Average TTFT improvement across all rounds: 2.3×
- Output throughput: 1.63× more tokens per second
By comparison, standard FP8 quantization typically delivers a 1.3–1.5× TTFT speedup while requiring higher memory capacity. UltraQuant is especially effective in high-turn-count agents (multi-turn), where the KV cache of late rounds becomes the dominant bottleneck.
Applications: Multi-Turn Agentic Scenarios
The work targets production scenarios such as chatbots, RAG pipelines, and coding agents where context length grows with each exchange. The authors emphasize that UltraQuant is complementary to techniques such as GQA (grouped-query attention) and PagedAttention, with which it can be combined.
The paper was submitted on June 18, 2026, and published on June 19 on arXiv (arXiv:2606.20474).
Frequently Asked Questions
- What is a KV cache and why does it matter for LLM agents?
- The KV (key-value) cache stores intermediate attention results in transformers to avoid recomputing them at every new token — critical for multi-turn agentic conversations with long context.
- Which GPUs does UltraQuant support, and does it require special hardware?
- UltraQuant uses AMD GPUs with native support for scaled-MFMA operations (natively FP4), enabling full hardware acceleration without software emulation.
- How does UltraQuant treat keys and values differently?
- It applies an asymmetric approach: queries remain in FP8 precision, while keys and values are compressed to FP4 using Walsh-Hadamard rotation, which redistributes outliers and reduces quantization error.
Sources
Related news
Anthropic: Claude Code v2.1.183 Blocks Destructive Git and Infrastructure Commands in Auto Mode
AWS: SageMaker Gets Over 100 Detailed Inference Metrics and an Insights Dashboard on CloudWatch
GitHub: Copilot Retires Opus 4.6 (fast) on June 29, Adds AGENTS.md to Code Review and ai_credits_used Field to API