NVIDIA: Software Stack on Blackwell Cut DeepSeek V4 Token Cost Five-Fold in One Month
NVIDIA explains how stacked software optimizations on the Blackwell architecture — from NVFP4 precision to speculative decoding — achieve up to 20× greater throughput and a five-fold reduction in token cost for DeepSeek V4 models.
This article was generated using artificial intelligence from primary sources.
NVIDIA has published a detailed breakdown of the software optimizations that on Blackwell architecture — specifically GB300 NVL72 and GB200 NVL72 systems — achieve dramatically lower inference costs. The central figure: the token cost for DeepSeek V4 has been reduced five-fold within one month, entirely through stacked software improvements, with no changes to the model itself.
Why Software, Not Just Hardware?
Blackwell brought substantially greater raw compute power compared to Hopper, but the hardware alone does not optimize itself. The key is that the inference stack must exploit all system levels simultaneously — from computation precision and network topology to serving method and token generation approach. NVIDIA describes a stacking approach: each technique individually brings improvement, but the real effect emerges from combining them.
Four Techniques That Build Up to 20× Throughput
The foundation consists of four techniques that together achieve up to 20× greater throughput per GPU:
Disaggregated serving separates the prefill and decode phases of inference onto distinct hardware resources. The prefill phase, which processes the input prompt, and the decode phase, which generates tokens, have different hardware utilization characteristics — separating them allows each resource to operate in its optimal regime.
Large expert parallelism over NVLink enables MoE models such as DeepSeek V4 to distribute experts across multiple GPUs with NVLink bandwidth that minimizes communication overhead. The GB300 NVL72 and GB200 NVL72 systems have particularly high NVLink bandwidth designed specifically for such deployments.
NVFP4 precision reduces the memory footprint and increases arithmetic intensity. The Blackwell generation introduces hardware support for FP4, meaning low precision is achieved without emulation — with direct hardware throughput.
Multi-token prediction and speculative decoding generate multiple tokens per model pass, amortizing the fixed overhead of each decoding step. The DFlash implementation of speculative decoding achieves up to 15× greater throughput compared to classic decoding.
In addition, NVIDIA highlights compute-communication overlap and kernel fusion as horizontal optimizations applied across the entire stack.
Real-World Production Results
Is This Confirmed in Production?
Yes — NVIDIA cites concrete results from partners using these optimizations in production:
Baseten serves DeepSeek V4 Pro on Blackwell and records up to 50% more tokens per second with TensorRT-LLM optimizations, compared to the previous generation of the stack.
Hippocratic AI, which handles 10 million patient calls, implemented the optimizations on DigitalOcean infrastructure and achieved 30% greater throughput with sub-half-second latency — critical for real-time voice applications.
DFlash speculative decoding delivers up to 15× throughput improvement for scenarios where the distribution of output tokens can be predicted.
Cognition uses the NVIDIA Dynamo inference framework for reinforcement learning workloads where latency is critical for training loops.
The Tools That Build This Stack
NVIDIA describes an ecosystem of tools that together form the inference stack: TensorRT-LLM as an optimizing compiler for serving, NVIDIA Dynamo as an inference framework for complex multi-system deployments, and integrations with popular open-source solutions vLLM, SGLang, and PyTorch with native CUDA support.
It is worth noting that all listed partners implemented the optimizations independently — suggesting that the methodology is not specific to one customer, but is reproducible across different use cases, from healthcare to software development.
Context: The Importance of a 5× Cost Reduction
Token cost directly determines the economics of LLM applications. Five times lower cost within one month means applications that were previously marginally viable become clearly profitable, or the same budget can be spent on five times more inference. For frontier-scale models such as DeepSeek V4, which have hundreds of billions of parameters, every cost reduction factor has a proportionally larger effect on total operating costs.
Frequently Asked Questions
- By how much has NVIDIA reduced the token cost for DeepSeek V4?
- Through stacked software optimizations on Blackwell hardware, NVIDIA reduced the token cost for DeepSeek V4 by up to five times within one month, without changing the model itself.
- What are the key techniques that enable 20× greater throughput?
- A combination of disaggregated serving, large expert parallelism over NVLink, NVFP4 precision, multi-token prediction, speculative decoding, and compute-communication overlap achieves up to 20× greater throughput per GPU on Blackwell.
- What are the real-world partner results on production systems?
- Baseten achieves up to 50% more tokens per second with TensorRT-LLM, Hippocratic AI records 30% greater throughput with sub-half-second latency on DigitalOcean, and DFlash speculative decoding delivers up to 15× greater throughput.
Related news
AMD: ROCm Low-Latency GEMM Kernels Speed Up LLM Inference by Up to 1.79× on Instinct MI355X
AMD: Resource Manager automatically preempts inactive GPU workloads and returns resources to the cluster shared pool
AMD: MXFP4/MXFP6 mixed-precision quantization on MI355X delivers up to 29% higher throughput