arXiv:2605.12061 SAGE: self-evolving graph-memory engine reaches 91.6% Recall@5 on Natural Questions
SAGE is a new self-evolving graph-memory engine for LLM agents published on arXiv on 12 May 2026 by Juntong Wang and collaborators from the university. The engine uses a memory writer and memory reader (Graph Foundation Model) feedback loop that autonomously expands and reorganizes. Zero-shot open-domain retrieval achieves 82.5/91.6 Recall@2/5 on Natural Questions, with improvements on LongMemEval and HaluMem hallucination metrics.
This article was generated using artificial intelligence from primary sources.
Juntong Wang, Haoyue Zhao, Guanghui Pan, Xiyuan Wang, Yanbo Wang, Qiyan Deng and Muhan Zhang published on 12 May 2026 SAGE — a self-evolving graph-memory engine that addresses the long-term memory limit in language agents and the dynamics between structured retrieval and agent feedback.
Why is classical GraphRAG not enough?
Classical RAG and GraphRAG systems treat memory graphs as a static retrieval index — once built, the graph does not change, so the agent cannot introduce new connections or reorganize knowledge. SAGE starts from the premise that graph-structural roles (e.g., a node as an entity, an edge as a relation, a neighborhood as context) are a reusable signal that allows memory to mature through interaction.
How do the memory writer and memory reader work?
SAGE connects two components in a feedback loop. The memory writer incrementally builds a structured graph memory from the agent’s interaction history — adding nodes, edges and structural annotations. The memory reader uses a Graph Foundation Model for retrieval and, crucially, returns feedback to the writer: which nodes and edges were useful for an answer, where the structure broke down. The loop allows memory to evolve autonomously through reader-writer communication.
What do the benchmarks show?
Zero-shot open-domain retrieval on Natural Questions achieves 82.5 Recall@2 and 91.6 Recall@5. Multi-hop QA reaches the best average rank after two rounds of self-evolution — confirming that iterative feedback improves graph quality. Long-term memory and hallucination metrics improved on LongMemEval and HaluMem benchmarks.
Training and reader-writer feedback improved multiple performance metrics simultaneously, positioning SAGE’s graph memory as the foundation for long-horizon language agents — scenarios where individual interactions must be situated within a growing network of prior knowledge.
Frequently Asked Questions
- How does SAGE differ from classical GraphRAG systems?
- Classical RAG and GraphRAG systems treat memory graphs as static retrieval indexes; SAGE treats them as a dynamic long-term memory substrate that self-evolves by expanding and reorganizing through interaction, exploiting structural roles in the graph for better retention.
- What are the concrete benchmark results?
- Zero-shot open-domain retrieval on Natural Questions reached 82.5 Recall@2 and 91.6 Recall@5; multi-hop QA achieved the best average rank after two rounds of self-evolution; long-term memory and hallucination metrics improved on LongMemEval and HaluMem benchmarks.
Related news
arXiv:2605.22502: Compiling agentic workflows into LLM weights achieves near-frontier quality at 100× lower cost
arXiv:2605.22794: MOSS shows agents that self-improve by rewriting their own source code
arXiv:2605.22535: TerminalWorld benchmark measures LLM agents on real Linux terminal tasks without simulation