vLLM Mooncake: 3.8× throughput for agentic workloads

vLLM integrates Mooncake, an open-source distributed KV cache store that eliminates repeated prefix computation between agentic turns — on realistic Codex traces with 12 GB200 GPUs, throughput increases 3.8×, P50 TTFT drops 46×, end-to-end latency drops 8.6×, and cache hit rate jumps from 1.7% to 92.2%.

The vLLM team has published the integration of Mooncake, an open-source library for distributed KV cache storage, in response to a specific problem with agentic workloads: long multi-turn interactions where each turn adds only a few thousand new tokens but reuses 80K+ tokens of cached prefix. Without distributed caching, busy instances quickly exhaust local memory, and a load balancer routing the next turn to a different machine forces a full recompute.

What are the concrete gains on Codex traces?

On realistic Codex/GPT-5.4 traces from the SWE-bench Pro benchmark, using 12 GB200 GPUs, the integration achieves 3.8× higher throughput, 46× lower P50 TTFT (time to first token) and 8.6× lower end-to-end latency. Cache hit rate jumps from 1.7% to 92.2%, confirming that the main source of slowness was recomputation of identical prefixes.

Scaling to 60 GPUs maintains a cache hit rate above 95% with near-linear throughput scaling under round-robin routing. KV cache (Key-Value cache) is a structure that stores attention vectors of previous tokens to avoid recomputation; prefix sharing is the sharing of that cache across instances for a common conversation start.

How is Mooncake architecturally integrated?

The system uses a master-worker design: the master server manages metadata and health monitoring, clients on GPU nodes form a distributed pool over GPUDirect RDMA, and vLLM connects via the existing KVConnector interface already used for prefill-decode disaggregation. The MultiConnector chain allows a request to recover its prefix from either a prefill instance or the shared pool.

GPUDirect RDMA means data is transferred directly between GPU HBM and CPU memory without passing through GPU SMs or CPU staging buffers, keeping GPU kernels free from interference. Asynchronous background threads prepare RDMA descriptors off the critical path.

What does this change for production agentic systems?

Analysis of 610 traces from Codex/GPT-5.4 SWE-bench Pro showed a 94.2% potential cache hit rate, a 131:1 input-to-output ratio, a median of 33 turns per trace, and a P99 inter-turn delay of 5.2s to 81.4s. This means agentic workloads are dramatically skewed toward reuse — and systems that cannot share cache pay a penalty in real production.

The implementation is available as GitHub PR #40900. Planned next steps include NVMe SSD offloading, support for hybrid architectures, and cache-aware routing. Contributors come from Inferact, Ant Group, Approaching.AI, Huawei and Alibaba Cloud.

Frequently Asked Questions

What is KV cache and why does it matter for agents?

KV cache (Key-Value cache) stores already-computed attention vectors for previous tokens so they don't need to be recomputed with each new token. For agents with long multi-turn histories this is critical — without caching, every turn reprocesses the entire context.

What does prefix sharing mean in a distributed setting?

Prefix sharing is the sharing of KV cache for a common conversation prefix across vLLM instances. Without it, if a load balancer sends the next turn to a different machine, everything must be recomputed. Mooncake allows the entire vLLM cluster to share a cache pool over RDMA.

How does Mooncake achieve such large gains?

GPUDirect RDMA transfers data directly between GPU HBM and CPU memory without kernel intervention, asynchronous background threads prepare RDMA descriptors, and the MultiConnector chain enables prefix recovery from either a prefill instance or the shared pool.

vLLM: Mooncake distributed KV cache store integration delivers 3.8× higher throughput and 46× lower P50 TTFT for multi-turn agentic workloads

What are the concrete gains on Codex traces?

How is Mooncake architecturally integrated?

What does this change for production agentic systems?

Frequently Asked Questions

Sources

Related news