Anthropic: infrastructure noise shifts agentic benchmark results by up to 6 percentage points
Why it matters
Researchers at Anthropic have demonstrated that RAM configuration and CPU headroom can shift agentic coding benchmark results by 6 percentage points — more than the difference between top models on the leaderboard. They tested Terminal-Bench 2.0 and SWE-bench. Recommendation: leads below 3 percentage points warrant skepticism until eval configuration is documented and matched.
A team of Anthropic researchers led by Gian Segato, with contributions from Nicholas Carlini, Jeremy Hadfield, Mike Merrill and Alex Shaw, published a detailed study on April 17, 2026 titled “Quantifying Infrastructure Noise in Agentic Coding Evals”. The findings reveal a serious methodological issue affecting the interpretation of nearly every AI benchmark.
Main finding
Infrastructure configuration — specifically the amount of allocated RAM and CPU headroom — can shift agentic coding benchmark results by 6 percentage points. That is more than the current difference between top models on major leaderboards.
The researchers make a direct claim: “The gap between the most- and least-resourced setups on Terminal-Bench 2.0 was 6 percentage points (p < 0.01).”
Benchmarks tested
The study used two standard tests:
- Terminal-Bench 2.0 — primary focus, measuring agentic coding ability in a terminal environment
- SWE-bench — cross-validation on 227 tasks
The results are asymmetric: Terminal-Bench 2.0 shows a strong effect (6 pp), while SWE-bench is less sensitive (1.54 pp across a 5x RAM variation). This suggests the specific structure of tasks and tools affects how “noisy” a benchmark is.
Strict capping makes the problem worse
The intuition might be: “Just give everyone the same resources and solve it.” But the data show the opposite:
- Strict capping (exact fixed value for all): infra error rate 5.8%
- Uncapped resources (unlimited): infra error rate 0.5%
In other words, strict uniformity actually increases noise, not decreases it, because edge tasks that exceed the limit fail.
Sweet spot: 3x resource headroom. That design reduces infra errors to 2.1 percent (p < 0.001) while maintaining result consistency. The idea is that each task has a “floor” (guaranteed) and a “ceiling” (kill threshold), instead of a single pinned number.
Noise floor and leaderboard interpretation
The sharpest message from the authors is aimed at the AI community that comments on small model differences:
“Leaderboard differences below 3 percentage points deserve skepticism until the eval configuration is documented and matched.”
The reason is statistical: binomial confidence intervals already cover 1–2 percentage points independently of any infra effect. When you add infrastructure confounders of another 6 pp on top, the natural measurement uncertainty is around 8 pp in the worst cases.
Five concrete recommendations
The researchers conclude with a concrete list for evaluators:
- Specify both a guaranteed allocation and a hard kill threshold per task (not a single pinned value)
- Calibrate the gap so that floor and ceiling scores fall within statistical noise
- Explicitly report the enforcement methodology
- Document resource specifications as first-class experimental variables
- Run evaluations across multiple days to average out temporal noise (API latency, cluster health variations)
Why this matters for the industry
The authors’ core conclusion: “A 2-point lead on a leaderboard might reflect a genuine capability difference, or it might reflect that one eval ran on beefier hardware, or even at a luckier time of day.”
For the AI community, this means a need for more structured infrastructure documentation when publishing results. Benchmarks published without precise RAM, CPU, API header and time-window configuration — which is most of them — carry noise that can completely bury nominal model quality differences.
Anthropic’s work arrives at a moment when differences between models are measured in single percentage points, and marketing presents those differences as revolutionary. The study shows why significantly more caution is needed here.
This article was generated using artificial intelligence from primary sources.
Related news
Anthropic and NEC build Japan's largest AI engineering workforce — Claude for 30,000 NEC employees
AWS: multimodal biological foundation models accelerate drug discovery by 50 percent and diagnostics by 90 percent
CNCF: infrastructure engineer migrated 60+ Kubernetes resources in 30 minutes with the help of an AI agent