AMD: New ATOM Inference Engine for Instinct GPUs Brings OpenAI-Compatible API and MoE Optimizations
AMD has introduced ATOM, an inference engine for Instinct GPUs that exposes an OpenAI-compatible API and orchestrates KV cache, scheduling, and parallelism. ATOM sits at the top of the ROCm stack, alongside AITER kernels and MoRI RDMA communication, supports TP, DP, and EP parallelism, and is optimized for MoE models such as DeepSeek V2–V4, Mixtral, and Qwen3-MoE. It offers FP8, MXFP4, INT8, and INT4 quantization and MTP speculative decoding with an EAGLE proposer.
This article was generated using artificial intelligence from primary sources.
AMD has introduced ATOM, an inference engine designed for Instinct GPUs that directly targets production serving of large language models on AMD hardware.
What does ATOM offer and where does it fit in AMD’s stack?
ATOM exposes an OpenAI-compatible API and orchestrates KV cache, scheduling, and parallelism during inference. An inference engine is the layer that receives requests and manages model execution on GPUs. ATOM sits at the top of AMD’s stack: ROCm as the platform, AITER for kernel acceleration, MoRI for RDMA communication between nodes, and ATOM as the serving layer. RDMA (Remote Direct Memory Access) enables direct memory transfers between devices without burdening the CPU.
Which models and parallelism modes does ATOM support?
ATOM supports tensor (TP), data (DP), and expert (EP) parallelism, and is especially optimized for MoE (Mixture of Experts) models. Explicitly listed models include DeepSeek V2 through V4, Mixtral, Qwen3-MoE, Kimi-K2.5, and MiniMax-M2. Expert parallelism distributes individual MoE “experts” across multiple GPUs, which is key to efficiently serving large MoE architectures.
How does ATOM accelerate inference?
ATOM offers quantization in FP8, MXFP4, INT8, and INT4 formats, with automatic detection from the HuggingFace model configuration. Quantization reduces the precision of weights to speed up inference and reduce memory consumption. Additionally, ATOM uses MTP speculative decoding with an EAGLE proposer, prefix cache sharing, and piecewise compilation for faster processing.
How is ATOM used in practice?
ATOM can run standalone or as a plugin for vLLM and SGLang, two popular LLM serving libraries. AMD is also publishing a public benchmark dashboard with nightly performance tracking, offering a transparent signal about progress in serving on Instinct GPUs as an alternative to the NVIDIA stack.
Frequently Asked Questions
- What is AMD ATOM?
- An inference engine for AMD Instinct GPUs that provides an OpenAI-compatible API and orchestrates KV cache, scheduling, and parallelism.
- Which models does ATOM optimize?
- MoE models such as DeepSeek V2–V4, Mixtral, Qwen3-MoE, Kimi-K2.5, and MiniMax-M2.
- Which quantization formats does ATOM support?
- FP8, MXFP4, INT8, and INT4, with automatic detection from HuggingFace configuration.
Related news
AMD: Gluon block-level model enables GEMM kernels with 5.255 TFLOPS MXFP4 on Instinct MI355
AMD: ROCm 7.13 brings MI350P GPU, multi-VF virtualisation and TheRock packaging
AMD ROCm: BubbleFence partitions video streams using Vision Foundation model embeddings instead of metadata heuristics