NIST CAISI: DeepSeek V4 Pro 8 months behind US frontier

The US Center for AI Standards and Innovation (CAISI) at NIST published on May 1, 2026 an independent evaluation of the DeepSeek V4 Pro model. Conclusion: it is the most capable evaluated PRC AI model to date, but lags behind the US frontier by approximately 8 months in aggregate capabilities. The evaluation used non-public benchmarks across five domains: cybersecurity, software engineering, natural sciences, abstract reasoning, and mathematics.

The US Center for AI Standards and Innovation (CAISI) at the National Institute of Standards and Technology (NIST) published on May 1, 2026 an independent evaluation of the Chinese model DeepSeek V4 Pro. The result: the model is the most capable evaluated PRC AI system to date, but trails the US frontier by approximately eight months in aggregate capabilities.

How was the evaluation conducted?

CAISI applied non-public benchmarks across five domains: cybersecurity, software engineering, natural sciences, abstract reasoning, and mathematics. The use of unpublished benchmark sets is a methodological choice to prevent contamination — if a benchmark is not public, the model cannot encounter it during training, so results reflect actual capabilities rather than memorization.

As a consequence, CAISI results show a significantly larger gap than DeepSeek’s own self-reported numbers. This is an expected pattern in the industry: public benchmarks are subject to contamination, while private ones provide realistic estimates for frontier models. The difference between public and private results reveals how much a lab model has “trained on the test.”

What are the concrete findings on price?

Although technically trailing, DeepSeek V4 Pro is cheaper than GPT-5.4 mini on five of seven test sets. The price difference ranges from 53% lower to 41% higher, depending on the domain and task specifics. The cost advantage partially offsets the technical lag for real-world workloads where cost per correct answer is the key metric.

For enterprise buyers evaluating multi-cloud strategies or seeking model diversification, this cost profile makes DeepSeek V4 Pro a rational second model — not as a primary flagship, but as a cheaper alternative for tasks that do not require absolute peak capability.

What does this mean for China’s AI market position?

The CAISI evaluation is the first official US government document to quantify the US-PRC AI gap in months rather than qualitatively. Eight months is a significant but not insurmountable gap. The trend shows DeepSeek closing in — V3 trailed by approximately 12–14 months, V4 Pro by 8 months.

The broader policy message from the CAISI paper: the US lead is real, but not static. Export controls, GPU accumulation (with US restrictions), and the quality of Chinese open-weight models (Qwen, DeepSeek) together make the domestic alternative increasingly less “near-frontier.”

The evaluation is available at nist.gov and was updated on May 2, 2026.

Frequently Asked Questions

How far does DeepSeek V4 Pro trail US frontier models?

By approximately 8 months in aggregate capabilities, according to the independent CAISI evaluation using non-public benchmarks. This is a significantly larger gap than DeepSeek's own self-reported results suggest.

Which domains were tested?

Five domains: cybersecurity, software engineering, natural sciences, abstract reasoning, and mathematics. CAISI uses non-public benchmarks so that results are not contaminated by a model's training data.

What is the price-to-performance ratio?

DeepSeek V4 Pro is cheaper than GPT-5.4 mini on 5 of 7 tested sets, with a price difference ranging from 53% lower to 41% higher depending on the domain. The cost advantage partially offsets the technical gap.

NIST CAISI: DeepSeek V4 Pro is the most capable Chinese AI model to date, but trails US frontier by 8 months

How was the evaluation conducted?

What are the concrete findings on price?

What does this mean for China’s AI market position?

Frequently Asked Questions

Sources

Related news