arXiv MedMemoryBench: memory in medical AI agents

MedMemoryBench is the first benchmark for memory mechanisms in personalized healthcare agents, published on arXiv on 12 May 2026. A team from Zhejiang University built approximately 2,000 sessions and 16,000 turns through a human-agent collaborative pipeline. The main finding: mainstream AI architectures show memory saturation where continuous information influx degrades performance in medical reasoning.

Yihao Wang, Haoran Xu, Renjie Gu, Yixuan Ye, Xinyi Chen, Xinyu Mu and collaborators published on 12 May 2026 MedMemoryBench — the first systematic benchmark for memory mechanisms in personalized healthcare AI agents. The paper reveals that mainstream architectures have serious bottlenecks in high-stakes medical scenarios.

What gap does MedMemoryBench fill?

Existing agent memory benchmarks focus on everyday conversations and do not capture the complexity of real-world medical applications. The healthcare scenario has specific requirements — retaining therapy protocols over weeks, integrating lab results, tracking contraindications, and maintaining a patient’s medical history context. MedMemoryBench builds a dataset around these challenges with ~2,000 sessions and 16,000 interaction turns through a human-agent collaborative pipeline using clinically grounded synthetic patient profiles.

What is memory saturation?

The main finding of the paper is the “memory saturation” phenomenon — beyond a certain point, continuous information influx degrades performance instead of enhancing it. The agent fails to extract signal from the accumulated history, which in medical reasoning directly reduces precision. Saturation reveals that classical memory architectures (RAG, vector store, sliding window) lack a mechanism for prioritization or compaction in high-stakes domains.

How does the evaluate-while-constructing protocol differ?

The new “streaming assessment protocol” mimics production systems where memory grows during use, rather than classical static evaluation in which the full memory is set before testing. The protocol enables tracking of degradation over time and identification of the saturation point.

Comprehensive benchmarking shows that mainstream architectures have significant bottlenecks in medical reasoning complexity and robustness to noisy data — suggesting the need for domain-specific memory design if healthcare AI agents are to reach a production-ready level.

Frequently Asked Questions

What is memory saturation in medical agents?

Memory saturation is a phenomenon discovered in MedMemoryBench evaluation where a continuous influx of new medical information degrades agent performance beyond a certain point — the system fails to extract signal from the accumulated history and reasoning precision drops.

How does MedMemoryBench differ from existing benchmarks?

Existing benchmarks measure everyday conversations rather than high-stakes medical applications; MedMemoryBench uses an 'evaluate-while-constructing streaming assessment' that mimics dynamic memory accumulation in production systems instead of static evaluation.

arXiv:2605.11814 MedMemoryBench reveals memory saturation in medical agents — 2,000 sessions, 16,000 turns

What gap does MedMemoryBench fill?

What is memory saturation?

How does the evaluate-while-constructing protocol differ?

Frequently Asked Questions

Sources

Related news