What is the concrete effect of infrastructure on results?

On Terminal-Bench 2.0, the difference between the best- and worst-resourced configurations is 6 percentage points (p < 0.01). On SWE-bench the effect is smaller — 1.54 percentage points across a 5x RAM variation.

What is the optimal resource level?

3x resource headroom is the sweet spot — it reduces infra errors from 5.8 to 2.1 percent (p < 0.001) and maintains result stability. Strict capping to exact values introduces too much noise.

What conclusion do the authors draw for the AI community?

Differences below 3 percentage points on leaderboards are not statistically significant without a documented and matched infrastructure configuration. Eval config must be treated as a first-class experimental variable.

Anthropic: infrastructure noise shifts agentic benchmark results by up to 6 percentage points

Q: What is the optimal resource level?

3x resource headroom is the sweet spot — it reduces infra errors from 5.8 to 2.1 percent (p < 0.001) and maintains result stability. Strict capping to exact values introduces too much noise.

Q: What conclusion do the authors draw for the AI community?

Differences below 3 percentage points on leaderboards are not statistically significant without a documented and matched infrastructure configuration. Eval config must be treated as a first-class experimental variable.

A team of Anthropic researchers led by Gian Segato, with contributions from Nicholas Carlini, Jeremy Hadfield, Mike Merrill and Alex Shaw, published a detailed study on April 17, 2026 titled “Quantifying Infrastructure Noise in Agentic Coding Evals”. The findings reveal a serious methodological issue affecting the interpretation of nearly every AI benchmark.

Main finding

Infrastructure configuration — specifically the amount of allocated RAM and CPU headroom — can shift agentic coding benchmark results by 6 percentage points. That is more than the current difference between top models on major leaderboards.

The researchers make a direct claim: “The gap between the most- and least-resourced setups on Terminal-Bench 2.0 was 6 percentage points (p < 0.01).”

Benchmarks tested

The study used two standard tests:

Terminal-Bench 2.0 — primary focus, measuring agentic coding ability in a terminal environment
SWE-bench — cross-validation on 227 tasks

The results are asymmetric: Terminal-Bench 2.0 shows a strong effect (6 pp), while SWE-bench is less sensitive (1.54 pp across a 5x RAM variation). This suggests the specific structure of tasks and tools affects how “noisy” a benchmark is.

Strict capping makes the problem worse

The intuition might be: “Just give everyone the same resources and solve it.” But the data show the opposite:

Strict capping (exact fixed value for all): infra error rate 5.8%
Uncapped resources (unlimited): infra error rate 0.5%

In other words, strict uniformity actually increases noise, not decreases it, because edge tasks that exceed the limit fail.

Sweet spot: 3x resource headroom. That design reduces infra errors to 2.1 percent (p < 0.001) while maintaining result consistency. The idea is that each task has a “floor” (guaranteed) and a “ceiling” (kill threshold), instead of a single pinned number.

Noise floor and leaderboard interpretation

The sharpest message from the authors is aimed at the AI community that comments on small model differences:

“Leaderboard differences below 3 percentage points deserve skepticism until the eval configuration is documented and matched.”

The reason is statistical: binomial confidence intervals already cover 1–2 percentage points independently of any infra effect. When you add infrastructure confounders of another 6 pp on top, the natural measurement uncertainty is around 8 pp in the worst cases.

Five concrete recommendations

The researchers conclude with a concrete list for evaluators:

Specify both a guaranteed allocation and a hard kill threshold per task (not a single pinned value)
Calibrate the gap so that floor and ceiling scores fall within statistical noise
Explicitly report the enforcement methodology
Document resource specifications as first-class experimental variables
Run evaluations across multiple days to average out temporal noise (API latency, cluster health variations)

Why this matters for the industry

The authors’ core conclusion: “A 2-point lead on a leaderboard might reflect a genuine capability difference, or it might reflect that one eval ran on beefier hardware, or even at a luckier time of day.”

For the AI community, this means a need for more structured infrastructure documentation when publishing results. Benchmarks published without precise RAM, CPU, API header and time-window configuration — which is most of them — carry noise that can completely bury nominal model quality differences.

Anthropic’s work arrives at a moment when differences between models are measured in single percentage points, and marketing presents those differences as revolutionary. The study shows why significantly more caution is needed here.