What is ReasoningBank?

ReasoningBank is a memory framework for AI agents that distills insights from past successes and failures into strategies the agent uses before executing new tasks.

Does ReasoningBank require model retraining?

No. The framework enables test-time learning — the agent accumulates memory during deployment without touching model weights.

What is the performance gain?

On the WebArena benchmark, ReasoningBank achieved 8.3% higher success than an agent without memory, and on SWE-Bench-Verified 4.6% with approximately 3 fewer steps per task.

Google ReasoningBank: agents learn from experience without retraining, +8.3% success on WebArena

Google Research introduced ReasoningBank — a new memory framework that enables AI agents to learn from their own past attempts, both successful and unsuccessful, without the need to retrain the language model. The results are a significant increase in success rate on two demanding benchmarks.

What happened?

ReasoningBank is a framework that functions as a “continuous closed loop of retrieval, extraction, and consolidation” — as described by its authors in Google’s research blog. Before an agent takes an action, it retrieves relevant memories from the bank; after completing a task, an LLM-as-a-judge evaluates the outcome and distills the lessons into a new memory entry.

Each memory entry contains three parts: a concise title identifying the strategy, a brief descriptive summary, and distilled reasoning steps or operational insights drawn from past experience. This structure enables the agent to quickly search for and apply relevant strategies to new tasks.

The distinctive feature of ReasoningBank is its emphasis on learning from failure. While competing approaches such as Synapse store exhaustive action trajectories, and Agent Workflow Memory focuses only on successful attempts, ReasoningBank “distills errors into preventive lessons,” building what researchers call “strategic guardrails.”

Why does this matter?

On the WebArena benchmark — the standard for web navigation — ReasoningBank achieved 8.3% higher success rate compared to agents without memory. On SWE-Bench-Verified, a demanding benchmark for solving software engineering tasks on real GitHub repositories, the gain was 4.6% with approximately 3 fewer steps per task.

The key practical aspect is that these gains do not require changing model weights. This means development teams can place ReasoningBank on top of existing LLMs (Gemini, GPT, Claude) without expensive fine-tuning or loss of model provider warranties.

For enterprise applications, this opens the door to agents that improve during deployment — every incident, every failed action becomes learning material rather than just a log statistic. This is a direct embodiment of what the industry has long called for: agents that accumulate institutional knowledge.

The research team is led by Jun Yan and Chen-Yu Lee from Google Cloud, along with 15 additional researchers including Siru Ouyang, Jiawei Han, and Tomas Pfister.

How does ReasoningBank differ from previous approaches?

Until now there were two main approaches to agent memory. The first, Synapse, stores exhaustive action trajectories — every click, every input, every tool response. The problem is that this approach quickly becomes too specific to one concrete task and is difficult to transfer to new situations.

The second approach, Agent Workflow Memory, focuses only on successful trajectories — the agent learns what works, but not why something fails. ReasoningBank argues this is limiting because agents fail more often than they succeed, so the greatest room for improvement lies precisely in learning from failures.

The third difference is the level of abstraction. Instead of storing raw actions or results, ReasoningBank distills reasoning patterns — “strategies.” This means that memory from tasks on one website can help on an entirely different website because the strategy (“first confirm authentication, then check rate limit, then execute the action”) transfers across domains.

What’s next?

Alongside the framework itself, Google also introduced MaTTS (Memory-Aware Test-Time Scaling) — a technique that uses memory to scale at inference time through two approaches: parallel exploration (generating multiple trajectories in parallel) and sequential refinement (iteratively improving a single trajectory). This addition is particularly interesting because it shows that memory and compute scaling are not competing mechanisms but synergistic ones.

The next step will be the integration of ReasoningBank into Google’s product agents — likely the Gemini Deep Research agent and Google’s coding tools. The paper with detailed methodology is announced for the coming weeks on research platforms such as arXiv, and an open-source reference implementation is also expected.

For users building their own agents, the key lesson is that simply storing “what went well” is not enough — the real value lies in error analysis and the distillation of transferable reasoning patterns, not just action trajectories. ReasoningBank is the first publicly described framework to do this systematically, but the pattern will likely be quickly replicated in the ecosystems around Claude, GPT, and open-source models. For development teams experimenting with agents, this is a signal that memory architecture is becoming as important as the choice of LLM itself.

Google ReasoningBank: agents learn from experience without retraining, +8.3% success on WebArena

Google ReasoningBank: agents learn from experience without retraining, +8.3% success on WebArena

What happened?

Why does this matter?

How does ReasoningBank differ from previous approaches?

What’s next?

Sources

Related news