arXiv:2604.24697: SciCrafter shows GPT-5.2, Gemini 3 Pro, and Claude Opus 4.5 plateau at ~26% on a Minecraft discovery-to-application test
SciCrafter is a new Minecraft-based benchmark that tests AI agents' ability to discover causal regularities and apply them to build functional systems — the complete discovery-to-application loop. GPT-5.2, Gemini 3 Pro, and Claude Opus 4.5 all plateau at ~26% success. The authors decompose the loop into four capabilities and find that the bottleneck has shifted from problem solving to asking the right questions — a key signal for the next generation of agents.
This article was generated using artificial intelligence from primary sources.
A team of 12 authors (including Yitao Liang, Demetri Terzopoulos, and Ying Nian Wu) published on 27 April 2026 the paper SciCrafter (arXiv:2604.24697) — a Minecraft-based benchmark that tests what LMArena, MMLU, and Chatbot Arena practically do not: the ability of an AI agent to discover a causal regularity and apply it in a functional construction. This is the complete discovery-to-application loop, and frontier models plateau on it.
How is the test constructed?
Agents receive a parametric task of building a redstone circuit (Minecraft logic): light a given configuration of lamps simultaneously or in a timed sequence. Scaling the target parameters — number of lamps, required timing — increases the required construction complexity and technical knowledge, preventing the agent from simply “memorizing” solutions from pretraining. The test forces a genuine discovery component, not pattern matching.
Which models were tested and what were the results?
Frontier evaluation under a general-purpose code agent scaffold: GPT-5.2, Gemini 3 Pro, and Claude Opus 4.5. All three plateau at ~26% success. The difference between models is smaller than reproducibility noise — meaning the issue is not an individual model but the entire class of approach.
Why is this an important signal?
The authors decompose the discovery-to-application loop into four capabilities: knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application. Targeted interventions measure each capability’s contribution. The main finding: for frontier models the biggest obstacle is no longer knowledge application (the classic “I don’t know this algorithm”) but gap identification — the model does not know what it does not know, and does not know which questions to ask. In the authors’ words: “the bottleneck shifts from solving problems correctly to posing the right problems.” This directly affects how the next generation of agentic systems should be designed: tool-use and ReAct loops assume the agent knows what to look for — SciCrafter shows that assumption is not unconditional. The benchmark has been released as an open diagnostic probe.
Frequently Asked Questions
- What does 'discovery-to-application gap' mean?
- It is the loop in which an agent must discover a causal regularity (e.g. the timing logic of a redstone circuit) and apply it to build a functional system (e.g. lighting lamps in a specified pattern). It measures the joint capacity for discovery and execution — something traditional LLM benchmarks barely test.
- What is the main bottleneck the authors identify?
- For frontier models (GPT-5.2, Gemini 3 Pro, Claude Opus 4.5) the biggest new bottleneck is 'knowledge gap identification' — the ability to recognize what the agent does NOT know and which question to ask in the first place. The shift is from 'solving correctly' to 'asking the right question'.
Related news
arXiv:2605.22502: Compiling agentic workflows into LLM weights achieves near-frontier quality at 100× lower cost
arXiv:2605.22794: MOSS shows agents that self-improve by rewriting their own source code
arXiv:2605.22535: TerminalWorld benchmark measures LLM agents on real Linux terminal tasks without simulation