PyTorch TokenSpeed-Kernel: 3.6× faster LLM

TokenSpeed-Kernel is an open-source three-layer kernel subsystem that accelerates LLM inference on NVIDIA and AMD GPUs by up to 3.6× without code rewrites, and is already integrated into the vLLM inference framework.

What is TokenSpeed-Kernel and why does it matter?

TokenSpeed-Kernel is a three-layer open-source kernel subsystem — a set of low-level GPU programs that directly control LLM computation — designed to work equally well on NVIDIA and AMD silicon without any code rewrites. The PyTorch team released it as a response to a long-standing problem: high-performance kernels were tied to a single chip vendor, making LLM systems difficult to port to alternative hardware.

How much speedup in practice?

Measurements on the GPT-OSS 120B model running on an AMD MI355X GPU show dramatic improvement at every inference stage compared to Triton — the previous standard PyTorch kernel framework:

Attention prefill (input text processing phase): 1.4–2.3× faster than Triton
MoE decode — MoE (Mixture of Experts) is an architecture where the model activates only a subset of its parameters per token — 1.7–2.1× faster
End-to-end throughput (overall system throughput): 1.6–3.6× higher

The upper bound of 3.6× is not a marginal optimization — it means the same hardware can serve significantly more user requests per hour or generate responses multiple times faster.

How does the three-layer approach work?

TokenSpeed-Kernel splits code into three layers: a shared hardware-agnostic interface, an NVIDIA-specific backend, and an AMD-specific backend. When a developer calls an attention or MoE pass operation, the system automatically selects the correct backend for the detected GPU with no additional code on the user’s side.

The @register_kernel plugin mechanism additionally allows companies or researchers to add support for their own non-standard silicon architectures by integrating into the same system.

Integration and availability

TokenSpeed-Kernel is available as a standard Python package (pip install) and is already integrated into vLLM via pull request PR #46742 — vLLM is one of the most widely deployed open-source LLM serving frameworks used by hundreds of production systems. vLLM users automatically benefit from the speedup without changing their own configuration.

The bigger picture: the end of single-chip monopoly?

Until now, high-performance kernels were practically the exclusive domain of the NVIDIA ecosystem, written for the CUDA platform that does not run on AMD hardware. TokenSpeed-Kernel changes that dynamic: inference systems can now transparently switch between NVIDIA and AMD GPUs using identical code, increasing competition in the AI accelerator market and reducing the risk of single-vendor dependency for LLM production infrastructure.

Frequently Asked Questions

What is a kernel in the context of GPU inference?

A GPU kernel is a low-level program that runs on the graphics processor and directly controls matrix and attention computation — kernel speed directly determines how many tokens a model generates per second.

Does TokenSpeed-Kernel work only with NVIDIA GPUs?

No — TokenSpeed-Kernel is designed for a multi-silicon approach with separate backends for NVIDIA and AMD GPUs, and the plugin system (@register_kernel) allows adding support for new architectures without changing shared code.

PyTorch: TokenSpeed-Kernel — portable high-performance kernels for multi-silicon LLM inference