arXiv Argus: 86.2 BrowseComp with 64 parallel searchers

Q: What does the evidence assembly architecture specifically do?

Argus treats deep research as puzzle assembly — the Searcher conducts ReAct-style interactions and collects evidence traces for sub-queries; the Navigator maintains a shared evidence graph, identifies missing pieces, dispatches new Searchers, and synthesizes the final answer through reinforcement learning; the system works with 1, 8, or 64 parallel Searchers without retraining.

Q: What benchmark results does the paper report?

The single Searcher configuration achieves +5.5 percentage points above baseline; 8 parallel Searchers +12.7 percentage points; 64 parallel Searchers achieves 86.2 on the BrowseComp benchmark, surpassing every tested proprietary research agent; the Navigator reasoning context stays below 21.5K tokens despite scaling.

Argus is a new arXiv paper published on May 15, 2026 by Zhen Zhang, Liangcai Su, Zhuo Chen, and colleagues that presents an evidence assembly framework for deep research agents. The system uses a dual-agent architecture — Searcher (ReAct-style traces) + Navigator (shared evidence graph + RL synthesis) — achieving +5.5pp with a single Searcher, +12.7pp with 8 parallel, and a score of 86.2 on BrowseComp with 64 parallel searchers without exceeding context limits.

Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu, Simon Shaolei Du, Kaiyu Yang, Bo An, Lidong Bing, and Xinyu Wang published a paper on arXiv on May 15, 2026 presenting the Argus framework for evidence assembly in deep research agents — a new approach that solves the redundancy problem of parallel search agents.

What is the redundancy problem in parallel search agents?

Current state-of-the-art deep research systems (Perplexity Deep Research, OpenAI Deep Research, GPT-5 Research mode) typically use parallel rollouts — multiple model instances simultaneously exploring the same query.

The problem: rollouts duplicate effort. Three parallel agents often:

Search the same sources
Cite identical documents
Arrive at convergent but not complementary insights

Practical consequences: token cost multiplies linearly, but information gain does not scale proportionally. 8× parallelism might bring 2–3× corresponding improvement — far from optimal scaling.

What does the evidence assembly architecture specifically do?

Argus reframes the problem: deep research as puzzle assembly. Instead of each Searcher trying to solve the entire problem independently, the framework divides responsibility:

Searcher (ReAct-style trace collector)

Conducts ReAct-style interactions for sub-queries assigned by the Navigator
Collects evidence traces — pieces of information relevant to the sub-query
Returns structured evidence to the shared graph

Navigator (graph maintainer + RL synthesizer)

Maintains a shared evidence graph across all Searchers
Identifies missing pieces — where the evidence graph has gaps or unreliable connections
Dispatches new Searchers for targeted exploration
Synthesizes the final answer through a reinforcement learning policy

The difference is dramatic: parallelism does not create redundancy because each Searcher receives a distinct sub-query from the Navigator, which sees the entire evidence state. Each new Searcher adds a new piece, not a duplicate.

What benchmark results does the paper report?

The paper cites precise numbers for three scaling configurations:

Configuration	Improvement over baseline
Single Searcher	+5.5 percentage points
8 Parallel Searchers	+12.7 percentage points
64 Parallel Searchers	86.2 on BrowseComp

BrowseComp 86.2 with 64 parallel Searchers “surpasses every proprietary agent” benchmarked. This is a significant signal because BrowseComp is an industry-standard benchmark for web research agents, and “every proprietary agent” implies that Argus outperforms Perplexity Deep Research, GPT-5 Research, Claude Research mode, Google Gemini Deep Research.

How does context stay manageable with 64 parallel agents?

The classic skeptical question about parallel multi-agent systems: context explosion. If each Searcher generates an evidence trace of 2–5K tokens, 64 parallel = 128–320K tokens, exceeding the context window of most models.

Argus’s answer: Navigator reasoning context remains under 21.5K tokens despite scaling. The technique is not explicitly detailed in the abstract, but presumably uses:

Selective evidence projection — the Navigator reads not raw Searcher outputs but a structured graph representation
Compression at the graph level — nodes and edges are compact, not full text
Hierarchical summarization — Searcher outputs are summarized before graph integration

35B-A3B MoE backbone

Argus uses a 35 billion parameter MoE (Mixture of Experts) backbone with an A3B (3 billion active parameters) variant. Concrete implications:

Cost-efficient inference — only 3B active parameters per inference call, roughly 10× cheaper than a dense 35B model
Specialized expertise — different experts in the MoE can specialize for different research domains
Scalable architecture — can be trained further (more experts) without exponential compute increase

What does this mean for the deep research industry?

Argus results raise several important questions:

Proprietary moat eroded — if an open-source paper achieves BrowseComp 86.2 with 64 parallel agents, what is the moat of Perplexity/OpenAI Deep Research?
Cost dynamics shift — 64 parallel Searchers sounds expensive, but with 3B active parameters in a MoE, total cost may be lower than a single frontier model rollout
Scaling without retraining — the paper notes that the framework supports scaling “with a single Searcher or many in parallel without retraining” — key for production deployment where load varies

The paper fits into the 2026 trend of agentic system architecture papers challenging proprietary leader positions: GraphFlow (May 15, formal verification), Dual-Dimensional Consistency (May 14, 10× token reduction), CAST (May 14, +5.85pp tool use). All share the conclusion that architecturally smart approaches > raw model scaling for production agentic workloads.

arXiv:2605.16217 Argus: evidence assembly architecture for deep research agents achieves +12.7pp with 8 parallel searchers