🟡 🏥 In Practice Thursday, April 30, 2026 · 3 min read ·

EvalEval Coalition: AI evaluation is becoming the new compute bottleneck — GAIA single run $2,829, HAL leaderboard $40,000, academic auditors hit a budget wall before a technical one

Editorial illustration: a scale tipping toward evaluation costs over training costs

The EvalEval Coalition (Avijit Ghosh, Yifan Mai, Georgia Channing, Leshem Choshen) published on April 29, 2026, an analysis on the HuggingFace blog showing how AI model evaluation costs have exploded. A single GAIA run costs $2,829, the HAL leaderboard $40,000 (k=8 reliability $320,000), and PaperBench around $9,500 per agent. Static benchmarks compress 100-200×, agentic ones only 2-3.5× — an accountability barrier for independent auditors.

The EvalEval Coalition (Avijit Ghosh, Yifan Mai, Georgia Channing, Leshem Choshen) published on April 29, 2026, a detailed analysis on the HuggingFace blog that shifts the AI compute discussion from training to evaluation — and shows that the economics have flipped.

Concrete costs

Figures for individual frontier model evaluations in 2026:

BenchmarkCost
GAIA (single run)$2,829
Online Mind2Web (Browser-Use + Claude Sonnet 4)$1,577 for 40% accuracy
HAL (Holistic Agent Leaderboard, full)$40,000 for 21,730 rollouts
HAL with 8-run reliability~$320,000
PaperBench (full)~$9,500 per agent
The Well (full sweep)~$9,600
MLE-Bench (1 seed)~$5,500

For comparison: HELM (2022) cost around $100,000 in total for all models across all scenarios. In 2026, a single benchmark (HAL with reliability) exceeds that amount.

Benchmark compression — what works for static does not work for agentic

TypeMax compressionRanking preserved
Static LLM benchmarks100-200×
Agentic benchmarks2-3.5×Partially
Training-in-loop~1× (impossible)

Flash-HELM, tinyBenchmarks, and Anchor Points successfully reduce static eval to 1% of the size without losing rankings. For agents, only mid-difficulty filtering achieves 2-3.5× — multi-step interactions cannot simply be sub-sampled.

Accountability barrier

Perhaps the article’s most important argument:

“Academic groups, AI Safety Institutes, and journalists now hit a budget barrier before a technical one when trying to independently evaluate frontier agents. A single GAIA run can exceed a PhD student’s annual travel budget.”

Specific figures:

  • Three-seed comparison of six models: over $150,000
  • HAL k=8 reliability: $320,000
  • PaperBench with LLM judge: ~$9,500 per agent

The conflict: if only frontier labs can afford statistically reliable evaluation, the social process of evaluating AI systems concentrates within the very labs that build them. External validation becomes partial or absent.

Reliability multiplier and leakage

The study also documents a second problem: single-run accuracy is statistically unreliable.

  • τ-bench example: drop from 60% (single) to 25% (8-run consistency)
  • Holdout leakage: 12 of 17 agent benchmarks failed the holdout criterion
  • TAU-bench data poisoning discovered in December 2025, requiring removal

A proper k=8 reliability test multiplies all costs .

Proposed solutions

The EvalEval Coalition proposes three directions:

  1. Standardized data sharing — unified metadata schema with converters for HELM, lm-eval-harness, Inspect AI (evaleval/EEE_datastore)
  2. Pareto-efficient leaderboards — accuracy along with cost, not accuracy alone
  3. Mid-difficulty filtering — best-effort 2-3.5× compression for agents

Why does this matter?

The article is policy-relevant. The EU AI Act, NIST AI RMF, and the UK AISI evaluation framework all presuppose available independent evaluation. If evaluation costs more than a research grant, regulation exists only on paper.

“Whoever can pay for the evaluation gets to write the leaderboard.”

Practical for AI governance:

  • Budgeting evaluation as a core governance function, not a technical cost
  • Funding independent evaluation infrastructure (e.g., AISI, NIST budgets)
  • Reliability reporting (pass^k) as a regulatory standard
  • Considering eval cost when setting compliance requirements

Frequently Asked Questions

How much does it actually cost to evaluate a frontier model?
GAIA single run: $2,829. Online Mind2Web (Browser-Use + Claude Sonnet 4): $1,577 for 40% accuracy. Holistic Agent Leaderboard (HAL) full: $40,000 for 21,730 rollouts across 9 models and 9 benchmarks. HAL with 8-run reliability: ~$320,000. PaperBench (full): ~$9,500 per agent.
Why don't agentic benchmarks compress like static ones?
Static LLM benchmarks (HELM, tinyBenchmarks, Anchor Points) achieve 100-200× compression while preserving rankings. Agentic benchmarks achieve only 2-3.5× (mid-difficulty filtering) because agent benchmarks involve multi-step interaction that does not allow simple sub-sample reduction without losing information.
What is the 'accountability barrier'?
Academic groups, AI Safety Institutes, and journalists now hit a **budget** barrier before a technical one when trying to independently evaluate frontier agents. A single GAIA run can exceed a PhD student's annual travel budget. This means only the frontier labs that produce the models can afford credible evaluations, narrowing independent auditing.
🤖

This article was generated using artificial intelligence from primary sources.