🟢 🔧 Hardware Published: · 1 min read ·

AMD: ATOM optimizer — DP Attention and Two-Batch Overlap for DeepSeek-V4 on MI355X

Editorial illustration: AMD MI355X GPU chip with an inference throughput optimization graph

ATOM is AMD's open-source inference engine for the MI355X GPU that brings two optimizations for DeepSeek-V4: PrefillDelayer eliminates loss during rank coordination, and Two-Batch Overlap accelerates token balancing by overlapping network operations.

🤖

This article was generated using artificial intelligence from primary sources.

What is ATOM and why is AMD building its own inference engine?

ATOM is AMD’s open-source inference engine — a software layer that optimizes how the MI355X GPU runs large language models. Unlike approaches that require specialized all2all network hardware, ATOM demonstrates that standard collective primitives on standard interconnects can achieve comparable performance.

Two key optimizations for DeepSeek-V4

PrefillDelayer coordinates when Data Parallel ranks enter the prefill phase — eliminating so-called dummy-prefill loss that occurs when ranks wait for each other without doing useful work. The second optimization, Two-Batch Overlap, introduces per-token token balancing and overlaps AllGather and ReduceScatter network operations (AG/RS overlap), reducing overall wait time for network transfers.

Results on the SemiAnalysis InferenceX benchmark

Measurements were conducted on the SemiAnalysis InferenceX benchmark with a workload of 8K input and 1K output tokens. AMD highlights that ATOM on MI355X rivals specialized all2all approaches that otherwise require expensive custom interconnect hardware — a significant result for standard infrastructure. The code is publicly available as open-source, making it accessible to anyone experimenting with DeepSeek-V4 on AMD hardware.

Frequently Asked Questions

What is the ATOM inference engine and how does it differ from standard solutions?
ATOM is AMD's open-source inference engine — a software layer that manages how the GPU executes AI models. It stands out by achieving high performance using standard network primitives instead of specialized all2all approaches that require expensive custom interconnects.
On which workloads was ATOM benchmarked?
Benchmarking was performed on the SemiAnalysis InferenceX test with a workload of 8K input and 1K output tokens, corresponding to typical production requirements for a large language model like DeepSeek-V4.