PyTorch/SGLang: DeepSeek-V4 Pro on NVIDIA GB300 — 5× higher throughput with the same interactivity
The PyTorch team and SGLang increased the serving throughput of the DeepSeek-V4 Pro model on NVIDIA GB300 architecture from around 2,200 to over 11,200 tokens per second per GPU between April and June 2026 — a fivefold improvement without sacrificing end-user interactivity.
This article was generated using artificial intelligence from primary sources.
Serving optimization, not a new model
The PyTorch team, in collaboration with the SGLang framework team, has published a detailed report on optimizing the serving of the DeepSeek-V4 Pro model on NVIDIA GB300 architecture (Blackwell Ultra). This is an engineering achievement in the inference infrastructure category — DeepSeek-V4 Pro remains the same model, but the way it is served has been radically improved.
SGLang (Structured Generation Language) is an open framework for high-performance serving of large language models that manages request scheduling, KV cache management, and kernel execution.
From 2,200 to 11,200 tokens per second
In April 2026 (day-0, the first day of deployment), the system achieved around 2,200 tokens per second per GPU at an interactivity level of 50 tokens per second per user. By June 2026, through a series of improvements, the same metric reached approximately 11,200 tokens per second per GPU — a 5× increase in throughput without changing the interactivity standard.
On an aggregated Blackwell Ultra configuration, a 2.85–2.91× improvement was recorded, with peak values exceeding 6×.
Key technical innovations
The results were achieved through a combination of several advanced kernels and algorithmic improvements:
- MHP fusion (MHC fusion) — merges multiple operations into a single GPU instruction and reduces memory access latency
- KV Compression V2 — more aggressive key-value cache compression reduces pressure on GPU memory bandwidth
- W4A4 MegaMoE — 4-bit weight and activation quantization for DeepSeek-V4 Pro’s Mixture-of-Experts architecture, with FP4 precision on GB300
MTP bug fix doubled speculative decoding efficiency
Multi-Token Prediction (MTP) — a speculative decoding technique where the model proposes several tokens in parallel and accepts those that match the final output — has a key metric called the speculative acceptance rate. A higher acceptance rate means fewer rejected speculations and higher actual generation speed.
After fixing a bug that caused NaN values, the acceptance rate was corrected from 0.57 to 0.70, which on its own contributed significantly to the overall 5× improvement. For comparison, without the MTP optimization the system would have remained well below the June figures even with the same kernels.
Practical significance
For cloud AI service providers, a fivefold throughput increase on the same hardware directly reduces the cost per generated token, or enables five times as many simultaneous users without additional investment in GPU infrastructure.
Frequently Asked Questions
- What is throughput and why does it matter for AI serving?
- Throughput measures how many tokens a model can generate per second per GPU — higher throughput means the same hardware can serve more users simultaneously at lower cost.
- What is Multi-Token Prediction and how does it help?
- MTP (Multi-Token Prediction) is a speculative decoding technique where the model predicts several tokens ahead in one step; improving the acceptance rate from 0.57 to 0.70 (after a NaN bug fix) further accelerates generation.