NIST: DeepSeek V4 Pro 8 months behind frontier models

The Center for AI Standards and Innovation at NIST (CAISI) has published an independent evaluation of the Chinese model DeepSeek V4 Pro across 9 benchmarks in 5 domains (cybersecurity, software engineering, natural sciences, abstract reasoning, mathematics). Key finding: V4 lags 8 months behind frontier US models, particularly on reasoning and agentic tasks that DeepSeek did not include in its own technical report. Cost of use is lower than GPT-5.4 mini in 5 of 7 tests.

The Center for AI Standards and Innovation at NIST (CAISI) has published an independent evaluation of the Chinese model DeepSeek V4 Pro that for the first time quantifies the gap between China’s strongest frontier system and US models. The result: V4 Pro is currently the most capable Chinese model CAISI has evaluated, but remains approximately 8 months behind frontier US models on most benchmarks. The evaluation specifically focused on reasoning tasks and agentic scenarios that DeepSeek did not include in its own technical report.

Which benchmarks were tested?

CAISI conducted testing across 9 benchmarks in 5 domains:

Cybersecurity: CTF-Archive-Diamond
Software engineering: SWE-Bench Verified, PortBench
Natural sciences: FrontierScience, GPQA-Diamond
Abstract reasoning: ARC-AGI-2 semi-private
Mathematics: OTIS-AIME-2025, PUMaC 2024, SMT 2025

The set includes held-out evaluations (PortBench, ARC-AGI-2 semi-private) that DeepSeek did not have in its own paper, enabling an independent check of generalization beyond benchmarks developed by the Chinese team.

How large is the actual gap between models?

Concrete results reveal an uneven distribution:

CTF-Archive-Diamond: GPT-5.5 71%, Opus 4.6 46%, DeepSeek V4 32%, GPT-5.4 mini 32%
SWE-Bench Verified: GPT-5.5 81%, Opus 4.6 79%, DeepSeek V4 74%, GPT-5.4 mini 73%
PortBench: GPT-5.5 78%, Opus 4.6 60%, DeepSeek V4 44%, GPT-5.4 mini 41%
ARC-AGI-2 semi-private: GPT-5.5 79%, Opus 4.6 63%, DeepSeek V4 46%
GPQA-Diamond: GPT-5.5 96%, Opus 4.6 91%, DeepSeek V4 90%, GPT-5.4 mini 87%

DeepSeek V4 comes closest to the frontier on GPQA-Diamond (only 6 percentage points behind GPT-5.5) and SWE-Bench Verified (7 points behind), but on CTF-Archive (cybersecurity) and PortBench (held-out SWE) the gap widens to 30+ percentage points. CAISI estimates this distribution corresponds to an 8-month lag on average, with a larger gap on tasks requiring multi-step reasoning and agentic capabilities.

What about costs?

The cost analysis shows that DeepSeek V4 Pro is cheaper than GPT-5.4 mini in 5 of 7 tests, ranging from 53% cheaper to 41% more expensive depending on the benchmark. This means that while V4 lags in quality, it has a concrete economic signal — for organizations optimizing cost-per-task on workloads where an 8-month lag doesn’t matter, V4 is a realistic option.

CAISI also confirms that DeepSeek’s technical report emphasized benchmarks where V4 looked “roughly on par with frontier US models,” and its weaker position on ARC-AGI-2 semi-private, PortBench, and CTF-Archive was not highlighted. This is an example of why independent government evaluations matter — they provide context for marketingdriven self-reported results.

Frequently Asked Questions

How far behind Western models is DeepSeek V4 Pro?

Approximately 8 months according to CAISI's assessment. Concrete examples: on CTF-Archive-Diamond, V4 achieves 32% versus GPT-5.5 at 71%, and on ARC-AGI-2 semi-private, 46% versus GPT-5.5 at 79% and Opus 4.6 at 63%.

Which 9 benchmarks were tested?

CTF-Archive-Diamond (cybersecurity), SWE-Bench Verified and PortBench (software engineering), FrontierScience and GPQA-Diamond (natural sciences), ARC-AGI-2 semi-private (abstract reasoning), OTIS-AIME-2025, PUMaC 2024, SMT 2025 (mathematics).

What is the cost comparison?

DeepSeek V4 Pro is cheaper than GPT-5.4 mini in 5 of 7 tests, ranging from 53% cheaper to 41% more expensive depending on the benchmark.

NIST CAISI evaluation of DeepSeek V4 Pro: 8-month lag behind frontier US models across 9 benchmarks in 5 domains

Which benchmarks were tested?

How large is the actual gap between models?

What about costs?

Frequently Asked Questions

Sources

Related news