Why do agents share a KV cache?

In enterprise multi-agent systems, multiple agents often work on related tasks and share the same context memory for efficiency — instead of each agent recomputing the KV cache for the same document, the system generates it once and shares it. This can reduce inference costs by 3–5×.

What is the risk of a shared KV cache?

The KV cache contains semantic embeddings of tokens that have passed through the LLM. If agent A processes a confidential document and leaves its KV cache behind, agent B with access to the same cache can reconstruct parts of that confidential content through attention probing.

arXiv LCGuard: KV cache security in multi-agent systems

Q: How does LCGuard close that channel?

The framework adds cryptographic isolation between KV cache regions of different security levels (security domains). The cache can be shared within the same domain but not across boundaries. It also adds a runtime detector that recognizes attention probing attempts and blocks them before they produce output.

LCGuard is a new framework for protecting against data leakage in multi-agent systems that share a KV cache for efficiency. The paper by IBM Research and MIT researchers led by Sadie Asif presents the first formal model for a 'latent communication guard' approach, applicable to production agentic RAG systems where multiple agents share context through a common memory.

The arXiv preprint LCGuard, published on 22 May 2026, introduces the first formal framework for protecting shared KV cache in multi-agent LLM systems. The paper is the result of collaboration between IBM Research and MIT, led by Sadie Asif. The authors document a vulnerability that has emerged in production enterprise agentic RAG systems over recent months and propose a concrete solution applicable without disrupting existing infrastructure.

What is a shared KV cache and why do we use it?

In a classical LLM workflow, every API call generates a new KV cache for the prompt — making inference a stateless atomic unit with no state surviving the call. In enterprise multi-agent systems this model becomes expensive. If five agents independently process the same confidential PDF, each rebuilds the identical KV cache from scratch, consuming 5× more GPU memory and 5× more compute.

The optimization that vendors are increasingly implementing is a shared KV cache. The PDF is processed once, the generated KV cache occupies roughly 200 MB of GPU memory, and all agents receive a pointer to that cache. Each agent’s inference starts from a pre-populated state and only appends its specific prompt suffix. The cost reduction is 3–5×, as the authors report — a difference between a viable and an uneconomical deployment for high-volume workloads.

How does data leakage work?

The KV cache is not plain text — it consists of embedding vectors that encode semantic information from the original document. But those vectors are not irreversibly transformed. The attention mechanism can extract significant information from them through a strategy the authors call attention probing.

A concrete attack: agent B has legitimate access to the shared KV cache (for example, because it is processing a related document). Agent B can craft specific prompts that target particular KV cache regions through attention layers 5–15 and thereby reconstruct individual sensitive entities from the original PDF — client names, contract numbers, monetary amounts. The authors demonstrate that reconstruction is not perfect but achieves 60–80 percent precise recall for named entities.

The attack vector is significant because enterprise users typically trust that “an agent has access only to its own prompt.” The reality is that an agent has access to the KV cache of the entire shared document — which was not documented in any production API documentation.

How does LCGuard close that channel?

LCGuard adds two lines of defense.

First line — cryptographic isolation: each KV cache is encrypted with a key that depends on the security domain origin of the document. An agent that does not belong to that domain can see the cache hint (whether it exists and its size) but cannot use it — decryption only occurs when the agent presents the appropriate domain credential. This means a PDF from the “finance/confidential” domain has a KV cache that marketing agents cannot decrypt, even though it physically occupies the same GPU memory.

Second line — runtime attention probe detector: the backend monitors attention patterns in real time and recognizes suspicious patterns. A typical probe uses a pseudo-random prompt structure that maximizes attention variation on targeted KV slots. LCGuard detects this pattern with 95+ percent precision (the authors document a low false positive rate on 50,000 legitimate queries).

Implementation overhead and compatibility

LCGuard requires modification of the attention layer in the inference engine (vLLM, TGI, SGLang). The authors have released a reference implementation for vLLM. Throughput overhead is 8–12 percent in the worst case scenario (all cache encrypted) or 3–5 percent in a typical scenario (a mix of encrypted and plain cache regions). This is an acceptable cost for enterprise tenants that must meet compliance requirements.

The paper concludes with a recommendation: LCGuard should become the default for enterprise deployments that use shared KV cache across security domains. Without this defense, organizations are unknowingly violating their own data classification policies.

arXiv:2605.22786: LCGuard protects shared KV cache between agents in multi-agent systems from data leakage

What is a shared KV cache and why do we use it?

How does data leakage work?

How does LCGuard close that channel?

Implementation overhead and compatibility

Frequently Asked Questions

Sources

Related news