AstaBench 2026: Opus 4.7 leads at 58%, GPT-5.5 half the price

Allen Institute published the updated AstaBench leaderboard with 2,400 problems for AI agents in science. Claude Opus 4.7 leads with 58.0%, while GPT-5.5 achieves 52.9% at half the cost per problem. Key finding: strong results on individual tasks do not automatically translate to robust end-to-end scientific work.

Allen Institute for AI (AI2) published the updated AstaBench leaderboard on April 30, 2026 — the most comprehensive public overview of AI agent capabilities in the context of scientific research to date.

How does AstaBench evaluate AI models for science?

AstaBench evaluates AI agents by solving more than 2,400 problems simulating real challenges from research practice — from data analysis and coding to literature synthesis and hypothesis generation. The benchmark is designed to go beyond typical accuracy leaderboards on isolated tasks.

The Spring 2026 update includes an expanded set of models and emphasizes the economic dimension: alongside accuracy, costs per solved problem are also published.

Which models lead and at what cost?

Claude Opus 4.7 (Anthropic) takes first place with a score of 58.0%, making it the leading model for end-to-end solving of scientific agentic tasks under AstaBench methodology.

GPT-5.5 (OpenAI) achieves 52.9% — 5.1 percentage points less — but at a cost of $1.61 per problem compared to $3.54 for Opus 4.7. For research teams trying to scale experiments, that 54% cost difference can be a decisive factor.

Key finding: why do high task scores not guarantee success?

Allen Institute specifically emphasizes that strong performance on individual tasks — such as code generation or data analysis — does not automatically translate to robust end-to-end scientific work.

Complex agentic scenarios require coordinating multiple steps, long-term planning, and consistent context tracking. Models that excel at isolated subtasks may struggle when they must integrate those capabilities into a cohesive research workflow.

Broader context and industry application

The AstaBench update comes with notes on industry partnerships, indicating growing commercial interest in structured evaluation of AI in research processes.

Results raise a practical question for research institutions: is the leading model’s higher accuracy worth twice the cost per problem? The answer depends on the type and scale of tasks the team solves.

Frequently Asked Questions

What does AstaBench measure?

AstaBench (Allen Institute for AI) measures AI agents' ability to solve problems typical of real scientific research — covering more than 2,400 tasks from various scientific domains.

Why might GPT-5.5 be a better choice than Opus 4.7 despite lower accuracy?

GPT-5.5 costs $1.61 per problem, while Opus 4.7 costs $3.54 — a 54% cost difference with only 5.1 percentage points difference in accuracy makes GPT-5.5 a cost-efficient choice for larger experiments.

AstaBench Spring 2026: Claude Opus 4.7 leads with 58% in scientific AI benchmark, GPT-5.5 half the cost

How does AstaBench evaluate AI models for science?

Which models lead and at what cost?

Key finding: why do high task scores not guarantee success?

Broader context and industry application

Frequently Asked Questions

Sources

Related news