arXiv:2605.10344: TMAS Multi-Agent Reasoning Record

TMAS (Test-time Multi-Agent Scaling) is a new approach to test-time compute scaling that organizes LLM inference as a collaboration between specialized agents with hierarchical memory banks. The authors (UC Berkeley + DeepMind) demonstrate surpassing all existing baseline methods (Best-of-N, MCTS, AutoTTS) on MATH-500, AIME 2024, HumanEval, and GPQA Diamond with the same compute budget. It combines reasoning, retrieval, and verification in a single pipeline.

What is TMAS and why does it matter now?

TMAS (Test-time Multi-Agent Synergy) is an architecture for organizing test-time compute scaling as a collaboration of specialized LLM agents. Traditional approaches (Best-of-N, Tree-of-Thoughts, MCTS) treat a single model as a monolithic reasoner — TMAS instead divides the problem into roles: the reasoner generates step by step, the retriever fetches relevant context from a memory bank, and the verifier checks intermediate steps. All three agents share the same base LLM but are given different system prompts and are focused on their own subtask.

Why this matters now: test-time scaling has become the dominant paradigm for improving reasoning ever since o1 (OpenAI) showed that chain-of-thought with “thinking time” yields results better than bigger models. AutoTTS (published May 11, today in arXiv:2605.08083) showed that agentic discovery can find optimal TTS strategies for a $39.9 compute budget. TMAS now generalizes this approach — rather than discovering the strategy, it explicitly structures inference as multi-agent collaboration.

What are the concrete results and how do they compare with baselines?

The authors test TMAS on 4 benchmarks. MATH-500: TMAS with GPT-4o-mini as the base achieves 78.4% accuracy, baseline (Best-of-32) 71.2%. AIME 2024: TMAS 56.7% vs. baseline 43.3%. HumanEval: TMAS 92.1% vs. baseline 88.9%. GPQA Diamond: TMAS 49.8% vs. baseline 40.5%. All results are at the same compute budget (measured in FLOPs), meaning the gain is entirely from structural reorganization of inference, not from additional compute.

Particularly interesting is the result on GPQA Diamond — a benchmark testing PhD-level scientific answers. There the difference of 9.3 percentage points is the largest, suggesting that TMAS scales better on harder problems. The reason: on easy problems a single agent already achieves a good result — TMAS adds value when the problem requires a combined retrieval + reasoning + verification approach.

How does TMAS work technically?

The pipeline has three phases. Phase 1 — Decomposition: the main controller agent breaks the problem into subtasks and assigns them to the reasoner agent. Phase 2 — Solve loop: the reasoner generates a step, queries the memory bank for relevant context, receives it from the retriever, and generates the next step. The verifier continuously checks intermediate steps and flags those that fail sanity checks. Phase 3 — Synthesis: the controller assembles the verified steps into a final answer.

The hierarchical memory bank is the key innovation. Standard LLM context is flat — all relevant information must fit in a single prompt. TMAS uses a bank with three levels: episodic (current problem state), semantic (domain knowledge retrieved from a vector DB), and procedural (successful strategies from past problems). The retriever agent autonomously decides which level to consult.

What does this mean for production applications?

For enterprise teams building reasoning agents (legal AI, medical diagnosis assistants, scientific research copilots), the TMAS approach is attractive because it addresses a known problem: driving a single large model into creative self-collaboration is difficult. A multi-agent setup with distinct roles maps naturally onto human teamwork, which eases debugging and interpretability.

Open question: latency. TMAS by definition consumes more compute per query than a single-agent baseline, which increases latency. The authors report a 3–5× slower response time than Best-of-N, which is acceptable for batch reasoning but not for interactive chatbots. For real-time agents (e.g., a coding assistant predicting next-line completions), TMAS is not yet practical.

Frequently Asked Questions

What is test-time compute scaling?

Test-time compute scaling is a technique that improves the quality of LLM responses by spending more compute at inference time (not training time). Examples: Best-of-N sampling (generating N responses and selecting the best), Tree-of-Thoughts (exploring a tree of possible reasoning steps), MCTS (Monte Carlo Tree Search). TMAS is the next generation of this approach.

How does TMAS improve on existing baselines?

TMAS introduces three key innovations: (1) specialized agents for different roles (reasoner, retriever, verifier), (2) a hierarchical memory bank that retains intermediate results across reasoning steps, (3) emergent coordination — agents learn to communicate without an explicit protocol. Result: for the same compute budget, TMAS achieves 3–12 percentage points better results.

arXiv:2605.10344: TMAS — multi-agent test-time scaling sets new records on reasoning benchmarks

What is TMAS and why does it matter now?

What are the concrete results and how do they compare with baselines?

How does TMAS work technically?

What does this mean for production applications?

Frequently Asked Questions

Sources

Related news