Bayesian Audit Reveals Flaws in AI Leaderboards

The paper introduces a Bayesian audit framework showing that a single final leaderboard snapshot of 1,000 systems can correspond to multiple incompatible historical trajectories, with convergence times ranging from 23 to 75 steps. Drawing on archived data from LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, and tau-bench, the author proposes an archive-and-adjudication protocol for reconstructing scoring history and rejecting unsubstantiated claims about frontier models.

A new preprint presents a Bayesian audit framework for evaluation leaderboards and warns that public AI model rankings may conceal mutually incompatible narratives about progress.

What problem does the paper reveal?

The author shows that a single final leaderboard snapshot of 1,000 systems is compatible with multiple incompatible historical trajectories. In other words, the same current ranking can arise from very different developmental paths — with convergence times to given performance thresholds ranging from as few as 23 steps in one scenario to 75 in another. This calls into question drawing conclusions about the pace of progress from only the latest state of a leaderboard.

What data does the analysis rely on?

The paper draws on archived data from five well-known evaluation sources: LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, and tau-bench. On that basis the author exposes risks of selective reporting and retroactive benchmark revisions that can distort perceptions of progress.

What does the author propose?

The proposed solution is an archive-and-adjudication protocol — systematically archiving leaderboard states and subsequently adjudicating them to reconstruct scoring history and reject unsubstantiated claims about frontier models. The proposal is directly relevant to EU AI Act requirements for transparency and third-party auditing of frontier models.

Frequently Asked Questions

What does this Bayesian framework show?

That a single leaderboard snapshot can correspond to incompatible developmental histories, with convergence times from 23 to 75 steps.

What is the archive-and-adjudication protocol?

A method for reconstructing scoring history and rejecting unsubstantiated claims about frontier model progress.

arXiv:2606.17005: Bayesian Framework for Auditing Reveals That AI Leaderboards Hide Incompatible Histories

What problem does the paper reveal?

What data does the analysis rely on?

What does the author propose?

Frequently Asked Questions

Sources

Related news