🟡 ⚖️ Regulation Published: · 4 min read ·

UK AISI: Agent Evaluations Must Account for Compute Budget

Editorial illustration: AISI research on AI model safety and capability evaluation with respect to test-time compute

The UK AI Security Institute shows that AI agent evaluations with a fixed token budget systematically underestimate frontier capabilities. Scaling the budget from one to ten million tokens raises performance by up to 25 percent on software-engineering tasks and 22 percent on math benchmarks. AISI calls on regulators to move from benchmark scores to capability curves that account for variable compute budgets.

🤖

This article was generated using artificial intelligence from primary sources.

The UK AI Security Institute (AISI) has published research with far-reaching implications for regulators and the safety community: standard AI agent evaluations, which measure performance at a fixed token budget, systematically underestimate the actual capabilities of frontier models. Rather than a single benchmark score, agent capability needs to be understood as a curve — a function that tracks how performance changes with the available compute budget.

Agent capability depends on the token budget

The research makes clear that knowing what an agent achieves at a given token budget is not enough. What matters is understanding how that performance changes when larger resources are provided. On software-engineering tasks, scaling the budget from 1 million to 10 million tokens delivered an improvement of roughly 25 percent. On mathematical and academic tasks, the improvement is around 22 percent. These differences are not negligible — they represent the gap between a model that cannot complete a task and one that solves it reliably and consistently.

A particularly telling finding comes from the cybersecurity domain: roughly 8 percent of all tested cyber tasks were solvable only at a budget of 10 million or more tokens. Within standard evaluation frameworks that use lower budgets, those tasks appear unsolvable — giving regulators and security researchers a distorted picture of the real risk that frontier models represent.

What do existing evaluations actually measure?

Standard benchmark tests choose a fixed token budget and measure how many agents successfully complete a set of tasks. This approach has a fundamental methodological problem: it is not comparable across models and does not reliably reveal the true boundaries of capability.

AISI measured that the capability horizon of one frontier model — defined as the longest task the model can reliably complete — grew from 40 minutes to 4 hours when the budget was increased from 2.5 million to 50 million tokens. The same difference also affects estimates of the pace of progress: frontier cyber capabilities double every 4.7 months at a 2.5-million-token budget. At a 50-million-token budget, that doubling rate accelerates by as much as 60 percent — meaning we are measuring entirely different developmental trajectories depending on where we set the evaluation boundary.

The compute requirement scales with the time that skilled humans need for the same task, following a power-law relationship with an exponent between 0.7 and 1.0. Tasks that take an expert an hour require millions of tokens; week-long projects require billions.

Newer models benefit disproportionately from more compute

The finding that is most concerning from a security perspective is the asymmetry between older and newer models. Newer frontier models consistently benefit more from an increased compute budget, across three dimensions:

  • Reach — capable of solving harder tasks at the same compute budget
  • Reliability — more consistently succeed on edge cases and complex scenarios
  • Efficiency — solve the same task with fewer tokens than earlier generations

This combination means that standardized tests not only underestimate current capabilities but also distort comparisons between model generations. An older model may look competitive at a low budget, while a newer model far surpasses it at the realistic budgets that users run in production. Evaluation frameworks that do not account for this asymmetry systematically misrepresent relative progress.

Regulatory implications of fixed budgets

AISI explicitly warns of a structural problem with direct policy implications. Risk assessments based on a fixed budget do not measure what they claim to measure — they systematically miss high-value, high-risk capabilities that only become accessible at higher compute levels. Single-budget evaluation can lead to unequal comparisons between models, cause decision-makers to underestimate agents, and conceal the true scale of risk.

Organizations that draft regulatory frameworks for AI — from national governments to international bodies — must recognize that a model’s benchmark score is not a fixed quantity. It is a function of the compute budget the evaluator set. Without an explicit specification of that budget, comparisons between models are methodologically unreliable.

AISI proposes a shift to a capability-curve approach: measure performance across a range of budget points, identify reach, reliability, and efficiency profiles for each model, and only then draw conclusions about risk. For security teams, the implication is clear: a model that showed no capability for a particular class of attack during evaluation may simply have been hiding that capability behind the evaluator’s budget boundary.

Frequently Asked Questions

What is test-time compute and why does it matter for evaluation?
Test-time compute is the amount of computational resources, measured in tokens, that an AI agent uses when solving a task. AISI shows that a larger budget directly raises performance, so capability should be measured as a curve, not as a single benchmark score.
How much improvement does a ten-times larger token budget deliver?
Scaling the budget from 1M to 10M tokens delivers around a 25 percent improvement on software-engineering tasks and around 22 percent on mathematical and academic tasks, according to AISI measurements.
Why does this matter for regulatory bodies?
Risk assessments based on a fixed budget structurally underestimate actual model capabilities. Newer models benefit disproportionately from additional compute, meaning standardized tests can create a false sense of security.