KellyBench: AI agents managing a betting bankroll through the Premier League season — all leading models lost money
KellyBench is a new benchmark for testing sequential decision-making: AI agents manage a betting bankroll through the entire 2023/24 Premier League season, using statistics, lineups, and market odds. All leading models tested lost money, and Claude Opus 4.6 scored 26.5% on the expert rubric for strategy sophistication.
What is KellyBench and how does it work?
KellyBench is a new research benchmark that tests AI agents’ ability to make long-term financial decisions under volatile conditions. Agents simulate the role of a bettor across the entire 2023/24 English Premier League season — they receive detailed historical statistics, team lineups, and market betting odds, and their task is to maximize the value of a bankroll through hundreds of consecutive decisions.
Unlike standard benchmarks that measure accuracy on individual answers, KellyBench tests sequential decision-making — every mistake in risk management has cumulative financial consequences.
What did the results show?
The results are unambiguous: all leading models tested lost money on average. Not one reached a zero return, and even the strongest model ended with an average return of -8%. Several models experienced complete financial ruin in individual trials — they lost the entire bankroll.
Claude Opus 4.6 scored 26.5% on a separate expert rubric that evaluates strategy sophistication — this is a distinct measure that says nothing about profit, but about how much the agent’s approach differs from naive betting.
Why does this matter for AI model development?
Sports betting is not just about predicting winners — it requires understanding probability theory, managing risk through losing streaks, and adapting strategy to changing market conditions. KellyBench reveals that current language models, regardless of their general capabilities, have significant weaknesses in long-term financial reasoning — a capability that is also critical for many real-world business applications.
Frequently Asked Questions
- What does KellyBench measure and how does it differ from standard AI benchmarks?
- KellyBench tests long-term sequential decision-making under volatile market conditions — not one-shot answers, but the ability to manage risk across hundreds of consecutive decisions with financial consequences.
- What was the best model's result?
- No model was profitable — even the strongest achieved an average return of -8%, and several models experienced complete financial ruin in individual trials.
- What does Claude's score of 26.5% on the expert rubric mean?
- The expert rubric evaluates strategy sophistication by comparing the agent's moves to what an experienced bettor would do — 26.5% means Claude Opus 4.6 shows partial understanding of bankroll management principles, but far below a competent human level.
This article was generated using artificial intelligence from primary sources.
Related news
GitHub is retiring GPT-5.2 and GPT-5.2-Codex from Copilot on June 1, 2026 — migration to GPT-5.5 and GPT-5.3-Codex
NIST CAISI evaluation of DeepSeek V4 Pro: 8-month lag behind frontier US models across 9 benchmarks in 5 domains
Anthropic closes 1M context beta for Sonnet 4.5 and Sonnet 4 — migration to 4.6 required