🟡 🔧 Hardware Published: · 2 min read ·

AMD: ROCm Low-Latency GEMM Kernels Speed Up LLM Inference by Up to 1.79× on Instinct MI355X

Editorial illustration: ROCm low-latency GEMM kernels accelerating LLM inference by up to 1.79× on Instinct MI355X, without text or faces

AMD has released the FlyDSL system within the AITER framework (AI Tensor Engine for ROCm) that automatically generates specialized GEMM kernels for the LLM decode phase on AMD GPUs. Results: 1.64× average latency reduction and 1.79× speedup for the most critical M≤8 token scenarios, tested on Instinct MI355X with 256 compute units.

🤖

This article was generated using artificial intelligence from primary sources.

What Is GEMM and the LLM Decode Phase?

GEMM (General Matrix Multiply) is the fundamental computational operation that dominates every pass through a large language model. In the prefill phase, the model processes the input prompt in parallel, but in the decode phase — when it generates output tokens one by one — the batch size M is typically small: 1, 2, 4, 8, or 16 rows. This asymmetry (small M, but large K and N in the thousands) makes the decode phase a critical bottleneck: standard GEMM routines optimized for high throughput deliver poor latency here.

Three FlyDSL Techniques Within the AITER Framework

AMD has developed AITER (AI Tensor Engine for ROCm) within the ROCm ecosystem, and within it FlyDSL — a generator that automatically synthesizes specialized GEMM kernels. FlyDSL combines three complementary techniques:

  1. Inter-CTA Split-K Parallelism — expands the launch grid along the K dimension, distributing work across multiple blocks (CTAs) and eliminating GPU resource underutilization.
  2. Intra-CTA K-slice Splitting — within a single CTA, splits the K axis into smaller slices, increasing useful parallelism without additional synchronization overhead.
  3. LDS Pipeline (multi-stage) — overlaps data transfer from global to local shared memory buffering (LDS) with active computation, hiding the memory latency of the AMD Instinct MI355X architecture (gfx950, 256 compute units).

Results and Hardware: 1.64× Average, 1.79× for Most Critical Scenarios

Benchmarking was conducted on 32 primary shapes plus 48 additional variants from real production models — DeepSeek V3, Llama 70B and Llama 450B, and Qwen32B — comparing FlyDSL kernels against three baseline implementations: HipblasLT, AITER Triton, and AITER ASM. The average latency reduction is 1.64× on key shapes (K=7168), while for the decode-critical scenario of M≤8 tokens the speedup reaches 1.79×. On specific shapes, a maximum gain of 2.37× was measured. On the broader set of BF16 shapes from production models, the average is 1.49×.

Can AMD Close the Software Gap Through a Programmatic Approach?

FlyDSL and AITER represent AMD’s systematic response to the software deficit in the ROCm ecosystem. While NVIDIA’s cuBLAS has a multi-year head start, AMD now generates high-performance kernels programmatically — meaning optimizations can be rapidly extended to new GPU architectures without manually writing assembly code. For operators considering a switch to AMD Instinct infrastructure, this progress in decode latency directly impacts cost per generated token.

Frequently Asked Questions

What is GEMM and why is it important for LLM inference?
GEMM (General Matrix Multiply) is the matrix multiplication operation that dominates computation in LLMs, especially in the decode phase when the model generates tokens one by one with small batch sizes like M=1, 2, 4, or 8.
Which models did AMD test with FlyDSL kernels?
Testing was conducted on matrix shapes from DeepSeek V3, Llama 70B, Llama 450B, and Qwen32B models on the AMD Instinct MI355X GPU with 256 compute units (gfx950 architecture).