LongMINT: AI agents average 27.9% on memory benchmark

Researchers at the University of North Carolina have published LongMINT — the first benchmark that systematically measures how poorly AI agents manage memory in long, dynamic scenarios. Average accuracy is just 27.9%, worse than random guessing in many cases.

Researchers at the University of North Carolina have published LongMINT — the first benchmark that systematically measures how poorly AI agents manage memory in long, dynamic scenarios. The result? An average accuracy of just 27.9% — worse than random guessing in many cases.

What is LongMINT and what does it measure?

LongMINT (Memory under Multi-Target Interference in Long-Horizon Agent Systems) is a benchmark with 15,600 question-answer pairs, an average context of 138,800 tokens — and up to 1.8 million tokens per example. Seven categories of systems are tested: plain language models, RAG systems, and memory-augmented agents.

A long-horizon agent is one that must retain accurate information over a long sequence of steps — such as tracking state, multi-turn dialogues, or code version control. Multi-target interference describes a situation where multiple pieces of information interfere with each other: later data revises earlier data, and the system must know which version is currently valid.

Why 27.9% accuracy is not surprising

The root problem is not context length but updates. When the same piece of information changes multiple times — which is normal in any real environment — agents consistently “remember” the wrong, outdated value. The more updates there are, the worse the accuracy. The bottleneck is retrieval and memory reconstruction, not just storage.

What this means for agent development

LongMINT exposes a fundamental limitation of the current generation of AI agents: they are not reliable in tasks where information evolves. This directly affects all systems that present themselves as “autonomous assistants” for long-horizon tasks — from coding to business processes. Until the memory layer becomes robust to interference, agents remain tools for short sessions, not for continuous work.

Frequently Asked Questions

What is LongMINT and what does it measure?

LongMINT (Memory under Multi-Target Interference in Long-Horizon Agent Systems) is a benchmark with 15,600 question-answer pairs, an average context of 138,800 tokens — and up to 1.8 million tokens per example. It tests seven categories of systems including plain language models, RAG systems, and memory-augmented agents.

Why do AI agents perform so poorly on LongMINT?

The core problem is not context length but updates. When the same piece of information changes multiple times — which is normal in any real environment — agents consistently "remember" the wrong, outdated value. The bottleneck is retrieval and memory reconstruction, not storage alone.

What does LongMINT mean for agent development?

LongMINT reveals a fundamental limitation of the current generation of AI agents — they are not reliable in tasks where information evolves. Until the memory layer becomes robust to interference, agents remain tools for short sessions, not for continuous long-horizon work.

arXiv:2605.18565: LongMINT — why AI agents forget everything you tell them

What is LongMINT and what does it measure?

Why 27.9% accuracy is not surprising

What this means for agent development

Frequently Asked Questions

Sources

Related news