EvalEval Coalition: AI evaluation is becoming the new compute bottleneck — GAIA single run $2,829, HAL leaderboard $40,000, academic auditors hit a budget wall before a technical one
The EvalEval Coalition (Avijit Ghosh, Yifan Mai, Georgia Channing, Leshem Choshen) published on April 29, 2026, an analysis on the HuggingFace blog showing how AI model evaluation costs have exploded. A single GAIA run costs $2,829, the HAL leaderboard $40,000 (k=8 reliability $320,000), and PaperBench around $9,500 per agent. Static benchmarks compress 100-200×, agentic ones only 2-3.5× — an accountability barrier for independent auditors.
The EvalEval Coalition (Avijit Ghosh, Yifan Mai, Georgia Channing, Leshem Choshen) published on April 29, 2026, a detailed analysis on the HuggingFace blog that shifts the AI compute discussion from training to evaluation — and shows that the economics have flipped.
Concrete costs
Figures for individual frontier model evaluations in 2026:
| Benchmark | Cost |
|---|---|
| GAIA (single run) | $2,829 |
| Online Mind2Web (Browser-Use + Claude Sonnet 4) | $1,577 for 40% accuracy |
| HAL (Holistic Agent Leaderboard, full) | $40,000 for 21,730 rollouts |
| HAL with 8-run reliability | ~$320,000 |
| PaperBench (full) | ~$9,500 per agent |
| The Well (full sweep) | ~$9,600 |
| MLE-Bench (1 seed) | ~$5,500 |
For comparison: HELM (2022) cost around $100,000 in total for all models across all scenarios. In 2026, a single benchmark (HAL with reliability) exceeds that amount.
Benchmark compression — what works for static does not work for agentic
| Type | Max compression | Ranking preserved |
|---|---|---|
| Static LLM benchmarks | 100-200× | ✓ |
| Agentic benchmarks | 2-3.5× | Partially |
| Training-in-loop | ~1× (impossible) | ✗ |
Flash-HELM, tinyBenchmarks, and Anchor Points successfully reduce static eval to 1% of the size without losing rankings. For agents, only mid-difficulty filtering achieves 2-3.5× — multi-step interactions cannot simply be sub-sampled.
Accountability barrier
Perhaps the article’s most important argument:
“Academic groups, AI Safety Institutes, and journalists now hit a budget barrier before a technical one when trying to independently evaluate frontier agents. A single GAIA run can exceed a PhD student’s annual travel budget.”
Specific figures:
- Three-seed comparison of six models: over $150,000
- HAL k=8 reliability: $320,000
- PaperBench with LLM judge: ~$9,500 per agent
The conflict: if only frontier labs can afford statistically reliable evaluation, the social process of evaluating AI systems concentrates within the very labs that build them. External validation becomes partial or absent.
Reliability multiplier and leakage
The study also documents a second problem: single-run accuracy is statistically unreliable.
- τ-bench example: drop from 60% (single) to 25% (8-run consistency)
- Holdout leakage: 12 of 17 agent benchmarks failed the holdout criterion
- TAU-bench data poisoning discovered in December 2025, requiring removal
A proper k=8 reliability test multiplies all costs 8×.
Proposed solutions
The EvalEval Coalition proposes three directions:
- Standardized data sharing — unified metadata schema with converters for HELM, lm-eval-harness, Inspect AI (evaleval/EEE_datastore)
- Pareto-efficient leaderboards — accuracy along with cost, not accuracy alone
- Mid-difficulty filtering — best-effort 2-3.5× compression for agents
Why does this matter?
The article is policy-relevant. The EU AI Act, NIST AI RMF, and the UK AISI evaluation framework all presuppose available independent evaluation. If evaluation costs more than a research grant, regulation exists only on paper.
“Whoever can pay for the evaluation gets to write the leaderboard.”
Practical for AI governance:
- Budgeting evaluation as a core governance function, not a technical cost
- Funding independent evaluation infrastructure (e.g., AISI, NIST budgets)
- Reliability reporting (pass^k) as a regulatory standard
- Considering eval cost when setting compliance requirements
Frequently Asked Questions
- How much does it actually cost to evaluate a frontier model?
- GAIA single run: $2,829. Online Mind2Web (Browser-Use + Claude Sonnet 4): $1,577 for 40% accuracy. Holistic Agent Leaderboard (HAL) full: $40,000 for 21,730 rollouts across 9 models and 9 benchmarks. HAL with 8-run reliability: ~$320,000. PaperBench (full): ~$9,500 per agent.
- Why don't agentic benchmarks compress like static ones?
- Static LLM benchmarks (HELM, tinyBenchmarks, Anchor Points) achieve 100-200× compression while preserving rankings. Agentic benchmarks achieve only 2-3.5× (mid-difficulty filtering) because agent benchmarks involve multi-step interaction that does not allow simple sub-sample reduction without losing information.
- What is the 'accountability barrier'?
- Academic groups, AI Safety Institutes, and journalists now hit a **budget** barrier before a technical one when trying to independently evaluate frontier agents. A single GAIA run can exceed a PhD student's annual travel budget. This means only the frontier labs that produce the models can afford credible evaluations, narrowing independent auditing.
This article was generated using artificial intelligence from primary sources.
Related news
DeepMind AI co-clinician: in blind evaluation of 98 primary care queries doctors preferred it over leading tools, zero critical errors in 97/98 cases
Anthropic Claude for Creative Work: Connectors for Blender, 50+ Adobe Creative Cloud Tools, Autodesk Fusion, Ableton, SketchUp, and Splice
Google ERA: AI system for scientific research reaches CDC top for hospitalization forecasting, solves an open cosmological problem, and tracks CO2 every 10 minutes