MoltBook is a platform on which over two million autonomous AI agents coexist. Researchers used it as a test environment for the first empirical evaluation of whether collective intelligence emerges spontaneously when agents are scaled to the millions.

What does the test measure?

The Superminds Test has three levels: joint reasoning, information synthesis, and basic interaction. Probing Agents come from outside, assign controlled tasks, and measure how the society responds compared to individual models.

Why is the main finding negative?

The authors argue that the dominant constraint is 'extremely sparse and shallow interaction' — threads rarely extend beyond a single reply, and a large share of responses is generic or off-topic. Scale alone does not produce coordination among agents.

What does this mean for multi-agent systems in practice?

It shows that increasing the number of agents does not automatically improve collective performance. System designers must explicitly work on interaction architecture, incentives for building on others' outputs, and synthesis mechanisms — otherwise you get many parallel monologues.

Superminds Test: 2M agents without collective intelligence

Researchers from the University of Melbourne and the University of Maryland introduced the Superminds Test, a hierarchical framework for probing the collective intelligence of agent societies. A study on the MoltBook platform with over 2 million agents showed that the society does not outperform individual frontier models and that interactions remain very sparse and shallow.

The paper “Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents” was published on arXiv under number 2604.22452 and delivers a finding that runs counter to the intuition of many in the multi-agent community. Behind the work are Xirui Li, Ming Li, Yunze Xiao, Ryan Wong, Dianqi Li, Timothy Baldwin, and Tianyi Zhou.

What question did the authors want to answer?

The question is simple and radical: “Does collective intelligence emerge spontaneously from scale?” In other words, if you place millions of autonomous LLM agents on a single platform and let them communicate freely, will the society as a whole become smarter than any individual agent?

This is an important question because a good portion of recent multi-agent systems implicitly assume the answer is yes — more agents, better reasoning, richer information synthesis, tighter coordination.

How did they measure it?

The authors introduce the Superminds Test, a hierarchical framework that does not evaluate agents in isolation but instead has probing agents actively test them within their own environment. The test has three levels:

Joint reasoning — can the society collectively solve a complex reasoning task?
Information synthesis — can it synthesize distributed information spread across multiple agents?
Basic interaction — can it even manage elementary coordination among a few participants?

Probing agents are controlled external actors who enter the community, assign tasks, and measure the responses.

Concrete results

The study was conducted on the MoltBook platform, which hosts over two million agents. The findings are, in the authors’ words, “stark”:

“Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks.”

In other words, a society of two million LLM agents does not outperform individual frontier models on complex reasoning tasks. It rarely synthesizes information distributed across multiple agents. It often fails even at trivial coordination tasks.

Platform analysis also reveals why:

“Interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic.”

Conversation threads rarely extend beyond a single reply, and most responses are generic or off-topic. Agents technically communicate, but do not build on one another.

Why does this matter?

The paper’s conclusion reads:

“Collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other’s outputs.”

The implications are significant. If your multi-agent system rests on the assumption that more agents will automatically solve reasoning problems, this paper suggests it will not work. Explicit architectural decisions are required that force agents to build on each other’s outputs rather than generating parallel monologues.

This opens space for a new generation of interaction protocols — structured debates, explicit citation, an aggregation layer that performs synthesis before the next round — all mechanisms that exist implicitly in human societies but must be deliberately designed in agent societies.

What comes next?

The Superminds Test itself is valuable as a measurement instrument — it can be applied to any multi-agent platform and yield a quantitative answer about the actual coherence of the society. The logical next step for the community is to compare architectures: which kinds of interactions actually raise scores across all three levels of the test? The paper does not answer that question, but it provides the instrument with which to search for the answer.

arXiv:2604.22452: Superminds Test shows collective intelligence does not emerge spontaneously in a society of 2 million AI agents

What question did the authors want to answer?

How did they measure it?

Concrete results

Why does this matter?

What comes next?

Frequently Asked Questions

Sources

Related news