How much memory does DeepSeek V4 use per sequence?

According to the vLLM team, at one million token context in bf16 precision, the KV cache amounts to approximately 9.62 GiB per sequence. Using FP8 or FP4 quantization reduces that figure to roughly half.

What is new in the vLLM implementation of V4 models?

vLLM solved three main challenges: memory management with fixed logical blocks of 256 positions, integration of compressed KV records with the prefix cache, and three targeted kernel fusions that together deliver 5 to 6 percent lower latency.

vLLM + DeepSeek V4: 8.7× smaller KV cache at 1M tokens

vLLM, one of the most widely used open-source serving frameworks for large language models, published full support for DeepSeek V4-Pro and V4-Flash on April 24, 2026. The key claim: a KV cache 8.7× smaller than what V3.2-style models would require at the same context length of one million tokens.

This is not merely a theoretical claim — the vLLM implementation in a production environment consumes approximately 9.62 GiB per sequence in bf16 at full one-million-token context. That is the difference between “we need an H100 cluster” and “fits on a standard production card.”

How does the KV cache optimization work?

DeepSeek V4 uses a four-layer strategy that vLLM had to support at the serving layer. First, shared KV vectors with inverse RoPE application deliver double the memory savings. Second, KV cache compression through weighted token aggregation depending on the method yields 4× to 128× savings.

The third layer is sparse attention that limits computation to the top-k compressed tokens, while the fourth — a local sliding window — preserves full vectors for the recent context so that precision in the model’s immediate focus is not lost.

In practical terms, this means the model simultaneously holds an aggressively compressed global context and precise local attention — a meaningful departure from classical GQA architectures that scale memory linearly with context length.

What did vLLM have to solve in the integration?

Integrating heterogeneous compression ratios into a single serving engine is non-trivial. The vLLM team highlights three main technical challenges they had to address.

The first is memory management: different attention layers have different compression ratios (4× for CSA, 128× for HCA), but vLLM uses fixed logical blocks of 256 token positions to preserve compatibility with the PagedAttention mechanism. This means the internal mapping of logical to physical blocks changes depending on the layer.

The second challenge is state: the compressor remainder is treated as a sliding-window KV, enabling integration with the existing prefix cache mechanism and disaggregated serving infrastructure. Without this trick, prefix caching — essential for production LLM serving — would not function across compressed sequences.

The third challenge is kernel efficiency: vLLM introduced three targeted fusions and multi-stream parallelization of GPU operations, together delivering 5 to 6 percent lower per-token latency compared to a naive implementation.

Why does this matter for production?

Until now, serving models with one million token context was confined to large cloud providers with custom hardware. KV cache memory scaled linearly with context, and 128K tokens already required multiple GPUs per sequence.

With DeepSeek V4 and vLLM’s integration, standard H100 or H200 configurations become sufficient for serving long contexts. Operational cost, according to vLLM’s claims, is reduced by an order of magnitude for long-context agentic workloads.

For development teams considering self-hosting instead of relying on Anthropic or OpenAI APIs — typically for GDPR compliance or data control reasons — this combination is a concrete argument. The V4-Flash model with 13 billion active parameters combined with the vLLM serving layer becomes a viable production option.

Full integration is available in the latest vLLM release via pip install vllm and supports both FP4 and FP8 quantization depending on hardware.

vLLM introduces DeepSeek V4 with 8.7× smaller KV cache: one million token context on standard GPU hardware

How does the KV cache optimization work?

What did vLLM have to solve in the integration?

Why does this matter for production?

Sources

Related news