🟢 🤖 Models Published: · 2 min read ·

arXiv:2605.19660: OScaR — INT2 KV Cache Quantization Delivers 3× Faster Decoding

arXiv:2605.19660 ↗

Editorial illustration: Researchers publish OScaR, a method solving the fundamental problem of KV cache quantization in large language models

Researchers have published OScaR, a method that solves the fundamental problem of KV cache quantization in large language models. Using INT2 precision — just 2 bits per value — it achieves near-lossless accuracy, 3× faster decoding, 5.3× less memory, and 4.1× higher throughput compared to BF16 FlashDecoding-v2.

🤖

This article was generated using artificial intelligence from primary sources.

Memory is today one of the greatest barriers to running large language models in production. Every time a model generates a new token, it must have the entire conversation context available — and this temporary memory, known as the KV cache (Key-Value cache), grows linearly with context length and can occupy dozens of gigabytes of GPU RAM.

Why was extreme KV cache compression an unsolved problem?

The standard approach is quantization — instead of 16-bit decimal numbers, values are stored in a smaller format. INT4 (4 bits) brings acceptable losses, but INT2 (2 bits) had until now resulted in a dramatic drop in accuracy. Researchers identified the cause: token norm imbalance — certain dimensions have outlier values that a 2-bit representation simply cannot store accurately.

OScaR (Occam’s Razor) solves this with an elegant two-step approach: channel rotation normalizes the value distribution, and then Omni-Token Scaling scales the remaining variations per token. The result is INT2 quantization that, according to the authors, achieves “near-lossless” accuracy on benchmarks.

What do the numbers actually mean?

Compared to BF16 FlashDecoding-v2 (the de facto standard for efficient inference):

  • 3.0× faster decoding — the model responds three times faster
  • 5.3× less memory — the same GPU can serve significantly longer contexts or more parallel requests
  • 4.1× higher throughput — more users on the same hardware

The method works on text, multimodal, and omni-modal models, the code is publicly available on GitHub, and the paper was submitted for review on May 19, 2026.

Practical significance for AI infrastructure

For companies running LLM inference in the cloud, these numbers translate directly into costs. If the same GPU can serve 4× more requests at the same latency, the cost per query drops by ~75%. OScaR, if it survives peer review and demonstrates the same results across a broader spectrum of models, could become a standard part of the inference stack alongside FlashAttention and speculative decoding.

Frequently Asked Questions

What is the KV cache and why is it difficult to compress?
The KV cache (Key-Value cache) is the temporary memory in which a language model stores intermediate attention results for already-processed tokens. Without it, every new token would require a full recomputation of the entire context. The challenge in compression is "token norm imbalance" — certain dimensions have extremely high values that standard quantization algorithms cannot accurately capture in a small number of bits.
What exactly does INT2 quantization mean?
INT2 quantization means each value in the KV cache is stored in just 2 bits instead of the standard 16 or 32 bits. This is "extreme" compression — 8× to 16× smaller than the usual BF16 format. With OScaR's rotation and token scaling, this compression achieves near-lossless model accuracy.
Does OScaR work only for text models?
No — OScaR is designed for text, multimodal, and omni-modal language models, making it applicable across the broader ecosystem of modern AI systems that combine text, images, and audio.