arXiv:2605.07313: agent memory does not scale — HippoRAG loses 16–20 pp reliability as irrelevant sessions accumulate
arXiv:2605.07313 is a scale-conditioned evaluation protocol that tests whether agent memory systems remain functional as irrelevant data accumulates. HippoRAG loses 16–20 percentage points of budget-compliant reliability, while LiCoMemory varies depending on model size. The authors (Shao, Lu, Zhang, Luo) conclude that reliability loss is not an isolated phenomenon.
This article was generated using artificial intelligence from primary sources.
A new arXiv paper (arXiv:2605.07313) asks a sharp question: do agent memory systems perform well when irrelevant data accumulates? Authors Jiaqi Shao, Yiyi Lu, Yunzhen Zhang, and Bing Luo present a scale-conditioned evaluation protocol that measures not just static accuracy but “whether the evidence is usable as irrelevant sessions accumulate.”
What the benchmark measures
The protocol evaluates three types of memory interfaces — flat, planar, and hierarchical — across multiple systems. It measures four diagnostic quantities: budget-compliant reliability, memory call load at scale extremes, error mode classification, and the boundary of usable scale.
Main findings: HippoRAG and LiCoMemory
HippoRAG stays within the call budget but loses 16–20 percentage points of budget-compliant reliability as irrelevant sessions accumulate. This means it formally works but delivers increasingly fewer correct answers under the same call constraints.
LiCoMemory shows variation depending on model size: smaller models (Qwen3-8B) exceed the budget, while larger ones remain reliable within the tested range. In other words, smaller models compensate for weaker memory through more calls — which exceeds practical limits.
Conditional scalability
The team concludes that “reliability loss is not an isolated phenomenon” and advocates for conditional scalability claims — scalability statements that specify the agent configuration, interface design, scale ranges, and interaction constraints they apply to. For production systems, this means generic claims like “our memory scales” are no longer sufficient — the conditions and system context must be stated.
Frequently Asked Questions
- What is HippoRAG and how does it behave?
- HippoRAG is an agent memory system inspired by human hippocampal recall. In the new benchmark it stays within the allowed call budget but loses 16–20 percentage points of budget-compliant reliability as irrelevant sessions accumulate — making it brittle in long-running agent deployments.
- What distinguishes flat, planar, and hierarchical memory interfaces?
- Flat memory stores records in a single list (retrieval scales linearly). Planar adds grouping or indexes at one level. Hierarchical organizes memory into a tree or multiple levels of summarization. The paper evaluates all three approaches under the same scale-conditioned protocol.
- Why budget-compliant reliability?
- Agents operate under call constraints — a memory query has a cost. Budget-compliant reliability measures how often an agent gets the correct answer within the permitted number of memory calls. If a system 'cheats' by making 100 memory calls, it may achieve formal accuracy but is not viable in production.
Related news
arXiv:2605.22502: Compiling agentic workflows into LLM weights achieves near-frontier quality at 100× lower cost
arXiv:2605.22794: MOSS shows agents that self-improve by rewriting their own source code
arXiv:2605.22535: TerminalWorld benchmark measures LLM agents on real Linux terminal tasks without simulation