🟡 🤖 Models Wednesday, May 6, 2026 · 2 min read ·

Microsoft Research: DroidSpeak shares KV cache across fine-tuned LLM variants for 4× higher throughput

Editorial illustration: diagram of KV cache sharing across multiple fine-tuned variants of the same base LLM in a data center.

Microsoft Research presented DroidSpeak at NSDI 2026 — a system that shares KV cache across architecturally identical fine-tuned LLM variants, achieving up to 4× higher throughput with minimal quality loss in enterprise scenarios running dozens of domain-specific models.

🤖

This article was generated using artificial intelligence from primary sources.

Microsoft Research presented eleven papers on networked systems at USENIX NSDI 2026, with DroidSpeak standing out — a system for sharing KV cache across fine-tuned variants of the same base language model.

KV cache (Key-Value cache) is a memory structure that stores computed attention keys and values for each prompt token; without it, the prefill phase would need to be recomputed from scratch for every request.

What problem does DroidSpeak solve?

The problem is real in enterprise environments: organizations often deploy dozens of fine-tuned variants of the same LLM for different business domains (legal, support, marketing, internal procedures). Each variant traditionally computes its own KV cache from scratch on every request, consuming GPU time and limiting overall system throughput.

DroidSpeak recognizes that architecturally identical models share significant overlap in prefill phases and introduces a sharing mechanism that recycles computations across variants.

How large is the performance gain?

The system achieves up to 4× higher throughput with minimal quality loss compared to classic isolation. This means the same hardware infrastructure can serve significantly more users or more parallel domain models without additional GPUs.

The technique is especially relevant for organizations maintaining a portfolio of specialized models on shared infrastructure.

What else did Microsoft present at NSDI 2026?

Alongside DroidSpeak, the same post describes the AVA system, which combines event knowledge graphs with agentic retrieval for long-video analytics, achieving 75.8% accuracy on the AVA-100 benchmark with videos over 10 hours long.

Eywa was also presented — a tool using LLMs to automatically test network protocols, finding 33 bugs, 16 of which were previously unknown. All three papers demonstrate how Microsoft Research is extending LLM techniques beyond pure NLP into engineering disciplines.

Frequently Asked Questions

What is KV cache in the context of LLMs?
KV cache (Key-Value cache) is a memory structure that stores computed attention keys and values for each prompt token, avoiding recomputation of the prefill phase on every request.
Why is KV cache sharing possible between fine-tuned variants?
Fine-tuned variants of the same base model have identical architecture and significant overlap in prefill computations, so DroidSpeak identifies these overlaps and reuses the cache instead of recomputing it from scratch.
What is NSDI 2026?
The USENIX Symposium on Networked Systems Design and Implementation 2026, the leading academic conference for networked systems research, where Microsoft presented eleven papers.