🟡 🤝 Agents Published: · 4 min read ·

arXiv:2605.16217 Argus: evidence assembly architecture for deep research agents achieves +12.7pp with 8 parallel searchers

arXiv:2605.16217 ↗

Editorial illustration: knowledge graph with evidence nodes and parallel searcher agents around a central navigator.

Argus is a new arXiv paper published on May 15, 2026 by Zhen Zhang, Liangcai Su, Zhuo Chen, and colleagues that presents an evidence assembly framework for deep research agents. The system uses a dual-agent architecture — Searcher (ReAct-style traces) + Navigator (shared evidence graph + RL synthesis) — achieving +5.5pp with a single Searcher, +12.7pp with 8 parallel, and a score of 86.2 on BrowseComp with 64 parallel searchers without exceeding context limits.

🤖

This article was generated using artificial intelligence from primary sources.

Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu, Simon Shaolei Du, Kaiyu Yang, Bo An, Lidong Bing, and Xinyu Wang published a paper on arXiv on May 15, 2026 presenting the Argus framework for evidence assembly in deep research agents — a new approach that solves the redundancy problem of parallel search agents.

What is the redundancy problem in parallel search agents?

Current state-of-the-art deep research systems (Perplexity Deep Research, OpenAI Deep Research, GPT-5 Research mode) typically use parallel rollouts — multiple model instances simultaneously exploring the same query.

The problem: rollouts duplicate effort. Three parallel agents often:

  • Search the same sources
  • Cite identical documents
  • Arrive at convergent but not complementary insights

Practical consequences: token cost multiplies linearly, but information gain does not scale proportionally. 8× parallelism might bring 2–3× corresponding improvement — far from optimal scaling.

What does the evidence assembly architecture specifically do?

Argus reframes the problem: deep research as puzzle assembly. Instead of each Searcher trying to solve the entire problem independently, the framework divides responsibility:

Searcher (ReAct-style trace collector)

  • Conducts ReAct-style interactions for sub-queries assigned by the Navigator
  • Collects evidence traces — pieces of information relevant to the sub-query
  • Returns structured evidence to the shared graph
  • Maintains a shared evidence graph across all Searchers
  • Identifies missing pieces — where the evidence graph has gaps or unreliable connections
  • Dispatches new Searchers for targeted exploration
  • Synthesizes the final answer through a reinforcement learning policy

The difference is dramatic: parallelism does not create redundancy because each Searcher receives a distinct sub-query from the Navigator, which sees the entire evidence state. Each new Searcher adds a new piece, not a duplicate.

What benchmark results does the paper report?

The paper cites precise numbers for three scaling configurations:

ConfigurationImprovement over baseline
Single Searcher+5.5 percentage points
8 Parallel Searchers+12.7 percentage points
64 Parallel Searchers86.2 on BrowseComp

BrowseComp 86.2 with 64 parallel Searchers “surpasses every proprietary agent” benchmarked. This is a significant signal because BrowseComp is an industry-standard benchmark for web research agents, and “every proprietary agent” implies that Argus outperforms Perplexity Deep Research, GPT-5 Research, Claude Research mode, Google Gemini Deep Research.

How does context stay manageable with 64 parallel agents?

The classic skeptical question about parallel multi-agent systems: context explosion. If each Searcher generates an evidence trace of 2–5K tokens, 64 parallel = 128–320K tokens, exceeding the context window of most models.

Argus’s answer: Navigator reasoning context remains under 21.5K tokens despite scaling. The technique is not explicitly detailed in the abstract, but presumably uses:

  • Selective evidence projection — the Navigator reads not raw Searcher outputs but a structured graph representation
  • Compression at the graph level — nodes and edges are compact, not full text
  • Hierarchical summarization — Searcher outputs are summarized before graph integration

35B-A3B MoE backbone

Argus uses a 35 billion parameter MoE (Mixture of Experts) backbone with an A3B (3 billion active parameters) variant. Concrete implications:

  • Cost-efficient inference — only 3B active parameters per inference call, roughly 10× cheaper than a dense 35B model
  • Specialized expertise — different experts in the MoE can specialize for different research domains
  • Scalable architecture — can be trained further (more experts) without exponential compute increase

What does this mean for the deep research industry?

Argus results raise several important questions:

  • Proprietary moat eroded — if an open-source paper achieves BrowseComp 86.2 with 64 parallel agents, what is the moat of Perplexity/OpenAI Deep Research?
  • Cost dynamics shift — 64 parallel Searchers sounds expensive, but with 3B active parameters in a MoE, total cost may be lower than a single frontier model rollout
  • Scaling without retraining — the paper notes that the framework supports scaling “with a single Searcher or many in parallel without retraining” — key for production deployment where load varies

The paper fits into the 2026 trend of agentic system architecture papers challenging proprietary leader positions: GraphFlow (May 15, formal verification), Dual-Dimensional Consistency (May 14, 10× token reduction), CAST (May 14, +5.85pp tool use). All share the conclusion that architecturally smart approaches > raw model scaling for production agentic workloads.

Frequently Asked Questions

What does the evidence assembly architecture specifically do?
Argus treats deep research as puzzle assembly — the Searcher conducts ReAct-style interactions and collects evidence traces for sub-queries; the Navigator maintains a shared evidence graph, identifies missing pieces, dispatches new Searchers, and synthesizes the final answer through reinforcement learning; the system works with 1, 8, or 64 parallel Searchers without retraining.
What benchmark results does the paper report?
The single Searcher configuration achieves +5.5 percentage points above baseline; 8 parallel Searchers +12.7 percentage points; 64 parallel Searchers achieves 86.2 on the BrowseComp benchmark, surpassing every tested proprietary research agent; the Navigator reasoning context stays below 21.5K tokens despite scaling.