🟡 🏥 In Practice Published: · 4 min read ·

arXiv:2605.22681: CUSP benchmark shows frontier models cannot reliably predict scientific breakthroughs

arXiv:2605.22681 ↗

Editorial illustration: scientific curve with breakthrough point and an AI system missing the prediction

The CUSP benchmark tests AI models' ability to predict scientific breakthroughs from a database of 4,700 events. Frontier models (GPT-5, Claude Opus 4.7, Gemini 3 Pro) identify plausible research directions but systematically miscalibrate outcomes and timing with overconfidence. Additional pre-cutoff context does not help — the limitation is structural, not informational.

🤖

This article was generated using artificial intelligence from primary sources.

An arXiv preprint published May 21, 2026, introduces CUSP (Curated Scientific Predictions), a benchmark for evaluating AI models’ ability to predict scientific breakthroughs. The database contains 4,700 scientific events across four domains: biomedicine, physics, climatology, and AI research. Frontier models — GPT-5, Claude Opus 4.7, and Gemini 3 Pro — were tested on their ability to assign an outcome probability to each event using pre-cutoff context (everything publicly known at the moment before the outcome).

How does CUSP formulate questions?

Each of the 4,700 events in the database is formulated as a binary question with a known outcome: “Will the mRNA malaria vaccine achieve >70 percent efficacy in phase 3 by October 2024?” “Will an open-source LLM with ≥1T parameters be released by December 2025?” “Will atmospheric CO₂ concentration exceed 425 ppm by December 2024?”

The model receives access to all publicly available information up to the cutoff date (the date before the outcome was known) and is asked to estimate the probability of a yes-outcome. Performance is measured by Brier score (calibration + discrimination) and a calibration curve showing how well predictions align with actual outcome frequencies.

What are the results for frontier models?

All three frontier models achieve Brier scores between 0.18 and 0.21 (lower is better, perfect is 0). For comparison, a naive “always 50 percent” baseline gives 0.25, while the human domain expert average is 0.14. Models are significantly better than chance but lag behind human experts.

The main pathology is overconfidence. A model assigns a prediction 90 percent confidence, but the actual success rate of such predictions is 60–70 percent. In the 95–99 percent confidence range (where the model claims near certainty), the actual success rate drops to 65 percent for GPT-5 and 71 percent for Claude Opus 4.7. This means that when a model says “almost certain to happen,” one should actually treat it as roughly 70 percent probability.

What does “the limitation is structural” mean?

The authors tested whether additional pre-cutoff context helps. They provided models with additional relevant arXiv papers, news archives, and expert commentary — all pre-cutoff, so nothing that leaked the actual outcome. Performance did not improve significantly. The Brier score dropped from 0.21 to 0.19 — a marginal difference.

The authors interpret this as meaning the limitation is not a lack of information. The limitation is structural: models do not distinguish between “scientifically plausible” and “will actually happen.” When a model reads 50 papers about a promising mRNA malaria vaccine, it detects plausibility but cannot assess operational barriers — how long phase 3 will take, how the FDA will respond, whether sponsors will have the budget to scale. That information exists in the public domain but not in a format the model knows how to extract.

What does this change for the use of AI in science?

Practical implications for AI-assisted forecasting are concrete. First, do not rely on AI point probability estimates — use AI for identifying relevant signals (papers, data, expert statements) and let a human forecaster handle integration. Second, if AI is used for forecasting, calibration must be additionally verified — a model that says “90 percent confidence” should be interpreted as “70 percent” until better calibration is demonstrated.

Third, the authors propose that for scientific forecasting one should use structured prompting that explicitly asks the model to enumerate barriers and reasons why the prediction MIGHT NOT happen — this helps reduce overconfidence even if it does not eliminate the problem.

The authors announce that CUSP will be updated quarterly with new events and that results will be published publicly for all frontier models.

Frequently Asked Questions

What is the CUSP benchmark?
CUSP (Curated Scientific Predictions) is a benchmark with 4,700 scientific events from biomedicine, physics, climatology, and AI research. Each event is formulated as a binary question (will X happen by Y) with a known outcome — the model receives pre-cutoff context and estimates probability.
What does overconfidence mean?
A model is overconfident when it rates its predictions with high probability (e.g., 90 percent) but the actual success rate of those predictions is lower (e.g., 60 percent). Frontier models on CUSP show systematic overconfidence — calibration is poor in the 70–95 percent confidence range.
Why does additional context not help?
The authors tested giving additional pre-cutoff papers, news articles, and data — performance did not improve significantly. Conclusion: the limitation is not a lack of information but a structural inability of the model to distinguish between 'scientifically plausible' and 'will actually happen'.