arXiv:2605.22681: CUSP benchmark shows frontier models cannot reliably predict scientific breakthroughs
The CUSP benchmark tests AI models' ability to predict scientific breakthroughs from a database of 4,700 events. Frontier models (GPT-5, Claude Opus 4.7, Gemini 3 Pro) identify plausible research directions but systematically miscalibrate outcomes and timing with overconfidence. Additional pre-cutoff context does not help — the limitation is structural, not informational.