MEMTIER: agent memory from 0.05 to 0.38 on LongMemEval

MEMTIER is a five-tier memory architecture for long-running autonomous agents — on the LongMemEval-S benchmark with Qwen2.5-7B, accuracy jumps from 0.050 to 0.382, and the tool execution success rate stops declining after 72 hours of operation.

A paper appearing on arXiv is the first to systematically document a problem specific to long-running autonomous agents: the tool execution success rate drops by 14 percentage points over a 72-hour operation window. The root cause is that classical RAG systems do not distinguish between short-term and long-term memory, so old context drowns out relevant signals.

What does the five-tier architecture deliver?

MEMTIER introduces five layers: an episodic JSONL layer for raw records, cognitively weighted retrieval with five signals (recency, frequency, salience, emotion, task-relevance), a PPO-based policy for adaptive weighting, and asynchronous consolidation of episodes into semantic memory that operates outside the agent’s main loop.

RAG (Retrieval-Augmented Generation) is an architecture where the model retrieves relevant documents from an external store before generating a response. PPO (Proximal Policy Optimization) is a standard reinforcement-learning algorithm — here it trains the agent to weight retrieval signals.

How large are the accuracy gains?

On the LongMemEval-S benchmark with 500 questions and a Qwen2.5-7B model on consumer hardware, accuracy jumps from a baseline of 0.050 to 0.382. This is a dramatic improvement that opens the door to practical long-running agents without enterprise infrastructure.

With DeepSeek-V4-Flash pre-population, single-section retrieval reaches 0.686 to 0.714, outperforming the BM25+GPT-4o RAG baseline. This makes MEMTIER not just an academic exercise, but a concrete alternative to Pinecone/Weaviate stacks for tasks where an agent runs for days.

Why does this matter to developers?

A team building an autonomous agent for customer support, financial analysis, or research tasks has until now had to rely either on enterprise vector databases or on manual context curation. MEMTIER demonstrates that the combination of proper memory-layer segregation and adaptive weighting can significantly reduce hardware requirements.

How the asynchronous consolidation behaves under production load remains to be seen, but results on the public benchmark suggest the architecture is a serious candidate for the next generation of open-source agent frameworks.

Frequently Asked Questions

What problem does MEMTIER solve?

A 14 percentage point drop in tool execution success over a 72-hour agent operation window, which classical RAG systems cannot prevent because they do not distinguish between short-term and long-term memory.

Can it run on consumer hardware?

Yes, the authors demonstrate results with a Qwen2.5-7B model on a consumer GPU configuration, which is notable compared to enterprise RAG setups.

How does it compare to classical BM25+GPT-4o RAG?

With DeepSeek-V4-Flash pre-population, MEMTIER reaches 0.686 to 0.714 on single-section retrieval and outperforms the BM25+GPT-4o baseline.

arXiv:2605.03675: MEMTIER — tiered memory architecture restores recall for long-running agents

What does the five-tier architecture deliver?

How large are the accuracy gains?

Why does this matter to developers?

Frequently Asked Questions

Sources

Related news