AI2: AI agents solve 80% of school-level science but only 20% of real scientific problems
Why it matters
The Allen Institute for AI analyzes two benchmarks that reveal a dramatic gap between AI performance on knowledge tests and the ability to make real scientific discoveries. While models reach 80% at the school level, they drop to 20% on complex scientific tasks.
The Allen Institute for AI (AI2) has published an analysis that exposes one of the most important gaps in the capabilities of today’s AI systems — the difference between “textbook knowledge” and the ability for genuine scientific discovery.
Two benchmarks, two stories
ScienceWorld tests elementary science experiments in a virtual environment — determining boiling points, genetic crosses, and similar tasks. Interestingly, models that achieved excellent results on multiple-choice questions from the same topics initially fell below 10% on ScienceWorld. By early 2025, top models reached about 80% — solid, but not complete mastery of fourth-grade material.
DiscoveryWorld is significantly more demanding — 120 tasks across eight scientific domains (proteomics, epidemiology, radioisotope dating, etc.) that require hypothesis formation, experiment design, execution, and analysis. Tasks are set in fictional contexts to prevent reliance on memorized knowledge.
A sobering comparison
On the more complex DiscoveryWorld tasks, AI agents complete only about 20% of tasks, while human scientists with advanced degrees solve about 70%. This 50 percentage-point gap clearly shows how far the road is from “knowing facts” to “knowing how to apply them for discovery.”
What this means
These results serve as an important reality check amid the enthusiasm around AI in science. While AI systems are excellent at data processing and pattern recognition, the ability to devise new experiments, adapt when things do not go as planned, and think creatively remains a deeply human skill.