🟡 🤝 Agents Published: · 3 min read ·

arXiv:2605.16233: FORGE — AI agents develop shared memory without fine-tuning

arXiv:2605.16233 ↗

Editorial illustration: arXiv:2605.16233 presents FORGE, a method by which LLM agents build shared memory through population-based experience sharing

arXiv:2605.16233 presents FORGE, a method by which LLM agents build shared memory through population-based experience sharing — without any model weight updates. On the CybORG CAGE-2 network defense task it achieves 1.7–7.7× better performance over the zero baseline, with particularly pronounced gains for weaker models.

🤖

This article was generated using artificial intelligence from primary sources.

A research team from Carleton University and the Canadian Department of National Defence published the FORGE paper (Failure-Optimized Reflective Graduation and Evolution) — a system in which LLM agents collectively build and share memory without a single model parameter being changed. Results on the benchmark network defense task show improvement of 1.7 to 7.7 times over the zero baseline.

The problem: expensive learning at the cost of flexibility

The standard approach to improving LLM agents is fine-tuning — a process in which gradient descent updates billions of neural network weights on a specific dataset. This process requires GPU hours, labeled examples, and freezes the model at the time of training. Each new domain or task requires a new training round.

FORGE takes a different path: instead of modifying the model itself, it builds shared memory — a common textual base of rules and demonstrations that is inserted into agent prompts in natural language form.

How FORGE bypasses fine-tuning

The system operates in two coupled cycles. The inner loop, by observing failed episodes, generates reusable knowledge artifacts — textual heuristics (Rules) or concrete demonstrations of successful moves (Examples). The outer loop then propagates the memory of the best-performing agent to the entire population between development phases, while agents that have reached convergence are “graduated” and frozen.

The key mechanism is precisely population broadcast: knowledge does not remain trapped in a single agent but is shared collectively. Researchers tested Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick and Qwen3-235B on the simulated CybORG CAGE-2 environment — a stochastic POMDP network defense task with a 30-step horizon in which a defender responds to an attack known as the B-line attacker.

Results: weaker models have the most to gain

FORGE achieves 29–72% better performance than the isolated Reflexion baseline, and reduces catastrophic error rates to around 1% (compared to strongly negative rewards in the zero baseline). Notably, the Rules variant uses ~40% fewer tokens with comparable results, while the Examples variant dominates for three out of four tested models.

Particularly relevant is the finding that weaker base models benefit disproportionately more — FORGE effectively compensates for the limited capabilities of a smaller model through collectively built population experience. This opens doors to applications where deploying a more powerful model is economically or latency-wise unacceptable, and domain knowledge can be encapsulated in shared memory.

The paper suggests that for specialized domains like cybersecurity defense, population memory may be a more effective alternative to expensive fine-tuning — especially when domain rules change rapidly.

Frequently Asked Questions

What is FORGE?
FORGE (Failure-Optimized Reflective Graduation and Evolution) is a method for developing LLM agent memory. Instead of changing model parameters, it builds a textual memory — rules and examples — that is inserted into agent prompts and shared across the entire population.
Why don't agents need fine-tuning?
FORGE uses population-based experience sharing — when one agent in the group learns a useful heuristic or demonstration, that knowledge is propagated to all other agents via the shared memory mechanism between training phases. There are no gradient updates — knowledge remains in natural language, not in network weights.
On which models was FORGE tested?
Researchers tested Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick and Qwen3-235B. Weaker models showed proportionally greater improvement, suggesting FORGE can compensate for the limited capacity of the base model.