🤖 24 AI
🟡 🏥 In Practice Saturday, April 18, 2026 · 3 min read

Anthropic: infrastructure noise shifts agentic benchmark results by up to 6 percentage points

Why it matters

Researchers at Anthropic have demonstrated that RAM configuration and CPU headroom can shift agentic coding benchmark results by 6 percentage points — more than the difference between top models on the leaderboard. They tested Terminal-Bench 2.0 and SWE-bench. Recommendation: leads below 3 percentage points warrant skepticism until eval configuration is documented and matched.

A team of Anthropic researchers led by Gian Segato, with contributions from Nicholas Carlini, Jeremy Hadfield, Mike Merrill and Alex Shaw, published a detailed study on April 17, 2026 titled “Quantifying Infrastructure Noise in Agentic Coding Evals”. The findings reveal a serious methodological issue affecting the interpretation of nearly every AI benchmark.

Main finding

Infrastructure configuration — specifically the amount of allocated RAM and CPU headroom — can shift agentic coding benchmark results by 6 percentage points. That is more than the current difference between top models on major leaderboards.

The researchers make a direct claim: “The gap between the most- and least-resourced setups on Terminal-Bench 2.0 was 6 percentage points (p < 0.01).”

Benchmarks tested

The study used two standard tests:

  1. Terminal-Bench 2.0 — primary focus, measuring agentic coding ability in a terminal environment
  2. SWE-bench — cross-validation on 227 tasks

The results are asymmetric: Terminal-Bench 2.0 shows a strong effect (6 pp), while SWE-bench is less sensitive (1.54 pp across a 5x RAM variation). This suggests the specific structure of tasks and tools affects how “noisy” a benchmark is.

Strict capping makes the problem worse

The intuition might be: “Just give everyone the same resources and solve it.” But the data show the opposite:

  • Strict capping (exact fixed value for all): infra error rate 5.8%
  • Uncapped resources (unlimited): infra error rate 0.5%

In other words, strict uniformity actually increases noise, not decreases it, because edge tasks that exceed the limit fail.

Sweet spot: 3x resource headroom. That design reduces infra errors to 2.1 percent (p < 0.001) while maintaining result consistency. The idea is that each task has a “floor” (guaranteed) and a “ceiling” (kill threshold), instead of a single pinned number.

Noise floor and leaderboard interpretation

The sharpest message from the authors is aimed at the AI community that comments on small model differences:

“Leaderboard differences below 3 percentage points deserve skepticism until the eval configuration is documented and matched.”

The reason is statistical: binomial confidence intervals already cover 1–2 percentage points independently of any infra effect. When you add infrastructure confounders of another 6 pp on top, the natural measurement uncertainty is around 8 pp in the worst cases.

Five concrete recommendations

The researchers conclude with a concrete list for evaluators:

  1. Specify both a guaranteed allocation and a hard kill threshold per task (not a single pinned value)
  2. Calibrate the gap so that floor and ceiling scores fall within statistical noise
  3. Explicitly report the enforcement methodology
  4. Document resource specifications as first-class experimental variables
  5. Run evaluations across multiple days to average out temporal noise (API latency, cluster health variations)

Why this matters for the industry

The authors’ core conclusion: “A 2-point lead on a leaderboard might reflect a genuine capability difference, or it might reflect that one eval ran on beefier hardware, or even at a luckier time of day.”

For the AI community, this means a need for more structured infrastructure documentation when publishing results. Benchmarks published without precise RAM, CPU, API header and time-window configuration — which is most of them — carry noise that can completely bury nominal model quality differences.

Anthropic’s work arrives at a moment when differences between models are measured in single percentage points, and marketing presents those differences as revolutionary. The study shows why significantly more caution is needed here.

🤖

This article was generated using artificial intelligence from primary sources.