Infrastructure
KV Cache
Cached key/value attention tensors that are reused across decoding steps to speed up inference in large language models.
KV cache is an inference-speedup technique that stores the key (K) and value (V) tensors computed in the layers of the attention mechanism so they can be reused across successive text-generation steps.
Large language models emit one token at a time, and each new token attends to every previous one. Without a cache the model would recompute the K and V projections for the entire sequence at every step, growing quadratically with length. Because those tensors do not change once a token is processed, the KV cache stores them, so attention for each new token is computed only over that token. This cuts the cost of inference from quadratic to roughly linear in sequence length.
Through 2025–2026, the KV cache is the dominant memory bottleneck for long context windows and high throughput. Its footprint grows with sequence length and the number of concurrent requests, driving techniques such as multi-query and grouped-query attention, cache quantization, and paging (PagedAttention) to keep memory in check.