🟡 📦 Open Source Thursday, May 7, 2026 · 2 min read ·

AMD: vLLM-ATOM plugin brings Instinct optimisations without changing vLLM code

Editorial illustration: vLLM-ATOM plugin brings Instinct optimisations without changing vLLM code

AMD has released vLLM-ATOM, an open-source plugin that integrates Instinct GPU optimisations into the vLLM production framework without any changes to the upstream source code. It activates automatically through Python entry_points, supports dense and MoE models including Kimi-K2.5 and DeepSeek V3/R1, and uses AITER kernels for fused MoE and flash attention.

🤖

This article was generated using artificial intelligence from primary sources.

What is vLLM-ATOM?

AMD presented on 7 May 2026 vLLM-ATOM, an open-source plugin that integrates optimisations for Instinct GPUs into vLLM, one of the most widely used production frameworks for serving large language models. The key characteristic is that the integration is achieved without a single change to the upstream vLLM source code — the plugin activates through the standard Python entry_points mechanism and registers two hooks: register_platform() and register_model().

Three-layer architecture

The plugin introduces a clear separation of concerns through three layers:

  • The vLLM layer retains control over request scheduling, KV cache management, continuous batching and the OpenAI-compatible API.
  • The ATOM plugin registers the platform, optimised model implementations and attention backend routing.
  • AITER provides low-level GPU kernels optimised for Instinct hardware.

This division allows AMD to contribute optimisations without forking the vLLM repository — which is critical for sustainability in the open-source ecosystem.

Which models are supported?

The plugin covers both text (LLM) and multimodal (VLM) models through dense and MoE architectures:

  • Kimi-K2.5 — multimodal MoE model (text/image/video)
  • DeepSeek V3 and R1 with MLA+MoE variants, including FP8 and MXFP4 quantisation
  • Qwen3 series in dense and MoE configurations
  • GLM-4 and GPT-OSS with MoE support

Attention backend routing is automatic: AiterBackend for standard Multi-Head Attention, AiterMLABackend for Multi-head Latent Attention architectures.

Why is this strategically important?

NVIDIA dominates the inference market thanks to a mature software paradigm as much as its hardware. AMD’s move with vLLM-ATOM — plus AITER kernels for fused MoE and flash attention — shows the company is focusing on a “zero-friction” experience: install the plugin alongside vLLM and the optimisations switch on automatically. A live benchmark dashboard tracks throughput, latency and accuracy across model updates, enabling production verification before scaling. For the open-source community building infrastructure around Kimi-K2.5 and DeepSeek, this is a concrete step towards hardware diversity.

Frequently Asked Questions

What is vLLM?
An open-source production framework for serving large language models, known for high throughput thanks to mechanisms such as continuous batching and PagedAttention KV cache.
What is MoE architecture?
Mixture of Experts — a model with multiple specialised sub-networks; only a subset is activated during inference, enabling large capacity at lower compute cost per token.
What is AITER?
AMD's library of low-level GPU kernels optimised for Instinct hardware — includes fused MoE, flash attention, quantised GEMM and RoPE fusion.