AI2: AI agents solve 80% of school-level science but only 20% of real scientific problems

The Allen Institute for AI (AI2) has published an analysis that exposes one of the most important gaps in the capabilities of today’s AI systems — the difference between “textbook knowledge” and the ability for genuine scientific discovery.

Two benchmarks, two stories

ScienceWorld tests elementary science experiments in a virtual environment — determining boiling points, genetic crosses, and similar tasks. Interestingly, models that achieved excellent results on multiple-choice questions from the same topics initially fell below 10% on ScienceWorld. By early 2025, top models reached about 80% — solid, but not complete mastery of fourth-grade material.

DiscoveryWorld is significantly more demanding — 120 tasks across eight scientific domains (proteomics, epidemiology, radioisotope dating, etc.) that require hypothesis formation, experiment design, execution, and analysis. Tasks are set in fictional contexts to prevent reliance on memorized knowledge.

A sobering comparison

On the more complex DiscoveryWorld tasks, AI agents complete only about 20% of tasks, while human scientists with advanced degrees solve about 70%. This 50 percentage-point gap clearly shows how far the road is from “knowing facts” to “knowing how to apply them for discovery.”

What this means

These results serve as an important reality check amid the enthusiasm around AI in science. While AI systems are excellent at data processing and pattern recognition, the ability to devise new experiments, adapt when things do not go as planned, and think creatively remains a deeply human skill.

AI2: AI agents solve 80% of school-level science but only 20% of real scientific problems

Two benchmarks, two stories

A sobering comparison

What this means

Sources

Related news