arXiv: Benchmarks Depend on Inference Compute

A paper shows that benchmark results depend heavily on the measurement protocol: 12 frontier models were tested on 7 demanding benchmarks from software, mathematics, medicine, and cybersecurity. A larger token budget significantly improves results on FrontierMath, Humanity's Last Exam, and TerminalBench, and models rank differently depending on inference compute budget. Authors recommend reporting capability as a function of inference-time compute, not as a single number.

A new preprint warns that benchmark results depend heavily on the measurement protocol, calling into question standard frontier model leaderboards.

What did the authors test?

The paper tests 12 frontier models on 7 demanding benchmarks from software, mathematics, medicine, and cybersecurity. The key variable is inference compute — the amount of computation, i.e. the token budget, the model is allowed to spend on solving a task. Standard evaluations typically fix that budget, but the paper shows it is precisely what matters most.

What changes with a larger budget?

A larger token budget significantly improves results on FrontierMath, Humanity’s Last Exam, TerminalBench, and cybersecurity tests. More importantly, models rank differently depending on that budget — the model that is best with a small budget is not necessarily best with a large one. This means fixed-budget evaluations systematically underestimate real capabilities.

Why does this matter for model assessment?

The authors recommend that model capability be reported as a function of inference-time compute, not as a single number. The finding is also relevant for security and policy assessments: evaluating models without controlling for compute budget can produce unreliable and misleading rankings.

Frequently Asked Questions

What is the main finding of the paper?

Results and rankings of frontier models depend significantly on inference compute budget, so fixed-budget evaluations underestimate capabilities.

What do the authors recommend?

Report model capability as a function of inference-time compute, not as a single number.

arXiv:2606.17930: Benchmark Results Are Protocol-Dependent — Inference Compute Changes Frontier Model Rankings

What did the authors test?

What changes with a larger budget?

Why does this matter for model assessment?

Frequently Asked Questions

Sources

Related news