arXiv:2604.24697: SciCrafter shows GPT-5.2, Gemini 3 Pro, and Claude Opus 4.5 plateau at ~26% on a Minecraft discovery-to-application test
Why it matters
SciCrafter is a new Minecraft-based benchmark that tests AI agents' ability to discover causal regularities and apply them to build functional systems — the complete discovery-to-application loop. GPT-5.2, Gemini 3 Pro, and Claude Opus 4.5 all plateau at ~26% success. The authors decompose the loop into four capabilities and find that the bottleneck has shifted from problem solving to asking the right questions — a key signal for the next generation of agents.
A team of 12 authors (including Yitao Liang, Demetri Terzopoulos, and Ying Nian Wu) published on 27 April 2026 the paper SciCrafter (arXiv:2604.24697) — a Minecraft-based benchmark that tests what LMArena, MMLU, and Chatbot Arena practically do not: the ability of an AI agent to discover a causal regularity and apply it in a functional construction. This is the complete discovery-to-application loop, and frontier models plateau on it.
How is the test constructed?
Agents receive a parametric task of building a redstone circuit (Minecraft logic): light a given configuration of lamps simultaneously or in a timed sequence. Scaling the target parameters — number of lamps, required timing — increases the required construction complexity and technical knowledge, preventing the agent from simply “memorizing” solutions from pretraining. The test forces a genuine discovery component, not pattern matching.
Which models were tested and what were the results?
Frontier evaluation under a general-purpose code agent scaffold: GPT-5.2, Gemini 3 Pro, and Claude Opus 4.5. All three plateau at ~26% success. The difference between models is smaller than reproducibility noise — meaning the issue is not an individual model but the entire class of approach.
Why is this an important signal?
The authors decompose the discovery-to-application loop into four capabilities: knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application. Targeted interventions measure each capability’s contribution. The main finding: for frontier models the biggest obstacle is no longer knowledge application (the classic “I don’t know this algorithm”) but gap identification — the model does not know what it does not know, and does not know which questions to ask. In the authors’ words: “the bottleneck shifts from solving problems correctly to posing the right problems.” This directly affects how the next generation of agentic systems should be designed: tool-use and ReAct loops assume the agent knows what to look for — SciCrafter shows that assumption is not unconditional. The benchmark has been released as an open diagnostic probe.
This article was generated using artificial intelligence from primary sources.
Related news
AWS publishes guide for building Strands Agents with SageMaker AI models and MLflow observability: SageMakerAIModel provider, autolog tracing, and A/B variant testing
OpenAI releases Symphony: open-source specification for Codex agent orchestration that turns issue trackers into 'always-on' engineering systems
arXiv:2604.21910: Agentic AI automates scientific workflow with 83% accuracy, 92% less data transfer and $0.001 per query