vLLM: open-source inference engine takes first place on the Artificial Analysis leaderboard
vLLM is an open-source inference engine that claimed first place on the Artificial Analysis leaderboard for three frontier models — DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B — through aggressive kernel fusion (33→10 launches per layer, 1.28× speedup), a custom EAGLE3 draft model for speculative decoding, and linear attention path optimizations.
This article was generated using artificial intelligence from primary sources.
vLLM, the open-source inference engine, has claimed first place on the Artificial Analysis leaderboard for three frontier models through targeted optimizations. The development team confirmed that vLLM now leads for DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B — the result of three distinct approaches, one per model.
DeepSeek V3.2: aggressive kernel fusion
On DeepSeek V3.2, vLLM achieves 230 tokens per second output throughput — as the announcement states, “more than 4× what most inference providers report.” The key is aggressive kernel fusion that merges normalization, rotary embedding, and quantization operations. The number of GPU kernel launches was reduced from approximately 33 to just 10 per layer, eliminating launch overhead at small batch sizes and delivering a 1.28× speedup at batch size 1.
MiniMax-M2.5: custom EAGLE3 draft model
For MiniMax-M2.5, vLLM achieves 326 tokens/s at concurrency 1 using a custom EAGLE3 speculative decoding setup. EAGLE3 is a technique where a smaller “draft” model predicts several tokens ahead, which the large model then verifies in a single pass. vLLM engineers trained a specialized draft model via TorchSpec, allowing it to learn on the actual hidden states that vLLM produces — rather than on generic datasets.
Qwen 3.5 397B: attention path fusion
Qwen 3.5 397B ranks first among all 12 measured providers, with sub-second TTFT (time-to-first-token) on long prompts. The optimizations targeted the model’s specific linear-attention architecture and its normalization patterns, yielding “up to 6.69 requests/s at concurrency 256” compared to baseline.
What this means for the open-source ecosystem
The result is significant: vLLM, which anyone can run on their own hardware, leads production benchmarks for three frontier models. For organizations running self-hosted inference — for privacy, data sovereignty, or cost predictability — this is proof that an open stack no longer pays a structural performance penalty against proprietary services.
Frequently Asked Questions
- What is kernel fusion and how much does it help?
- Kernel fusion is a technique that combines multiple small GPU operations into a single larger kernel launch, reducing launch overhead. On DeepSeek V3.2, vLLM reduced the number of launches from ~33 to ~10 per layer by merging normalization, rotary embedding, and quantization — delivering a 1.28× speedup at batch size 1.
- What is EAGLE3 and why does it matter for MiniMax-M2.5?
- EAGLE3 is a speculative decoding approach where a smaller 'draft' model predicts tokens that the main model then verifies. The vLLM team trained a custom EAGLE3 draft model using TorchSpec, teaching it on the actual hidden states that vLLM produces — achieving 326 tokens/s at concurrency 1 on MiniMax-M2.5.
- What does it mean that open-source can match proprietary inference?
- The Artificial Analysis leaderboard measures production performance across 12 inference providers. That vLLM — which anyone can run on their own hardware — ranks first for three frontier models shows that the open-source stack no longer has to pay a 'price of openness' in performance.
Related news
arXiv:2605.21006: Off-the-shelf persona vectors achieve 68-98% effectiveness of targeted sycophancy steering in LLM models
Black Forest Labs: FLUX Erase outperforms GPT Image-2 (68.5%) and Finegrain (63.2%) in prompt-free object removal
arXiv:2605.19762: ICML 2026 paper claims code does not improve LLM mathematical reasoning