AMD: Gluon block-level model enables GEMM kernels with 5.255 TFLOPS MXFP4 on Instinct MI355
The AMD ROCm team published a tutorial for writing high-performance GEMM kernels in the Gluon programming model on the MI355 GPU. An optimized FP16 kernel achieves 1.489 TFLOPS at 98.75 percent MFMA efficiency, while extensions to BF8 (3.257 TFLOPS) and MXFP4 (5.255 TFLOPS) demonstrate relevance for modern AI workloads. The tutorial includes workgroup remapping and swizzle that reduces L2 cache misses from 5.3 M to 4.1 M.
This article was generated using artificial intelligence from primary sources.
The AMD ROCm blog team published a detailed tutorial on May 22, 2026, for writing high-performance GEMM (General Matrix Multiplication) kernels in the Gluon programming model, targeting the AMD Instinct MI355 GPU. The tutorial documents achieving 98.75 percent MFMA efficiency for FP16 GEMM (1.489 TFLOPS), along with extensions to modern low-precision formats: BF8 delivering 3.257 TFLOPS and MXFP4 delivering 5.255 TFLOPS.
What is Gluon and how does it differ from HIP?
Gluon is AMD’s block-level programming model for GPU kernel development on CDNA and RDNA architectures. Its syntax is inspired by the CUDA Tile API and similar Triton block-level abstractions, but targets CDNA hardware specifics — MFMA matrix multiply units, multi-XCD chiplet topologies, and L2 cache structures.
The difference from a standard HIP program: HIP provides a C++-like CUDA-equivalent API where the compiler optimizes kernels. Gluon explicitly requires the developer to specify tensor layouts, pipeline stages, register budget per thread block, and swizzle patterns for memory accesses. This is more work for the developer, but provides finer control — which is crucial for achieving 95+ percent MFMA efficiency that is otherwise inaccessible through standard HIP optimizations.
Gluon is an open-source part of the ROCm ecosystem, available through the rocm-developer-tools package since the ROCm 7.0 release (January 2026).
How does the optimized FP16 kernel achieve 1.489 TFLOPS?
The MI355 is AMD’s latest data center GPU from the CDNA 4 generation with an 8 XCD chiplet topology. Theoretical peak FP16 throughput is approximately 1.508 TFLOPS (with ideally populated MFMA units). The tutorial shows the steps to 1.489 TFLOPS (98.75 percent of theoretical):
- Block-level tiling: tensors A (M × K) and B (K × N) are partitioned into blocks of size 128×128 corresponding to MFMA instruction granularity.
- Pipeline stages: compute (MFMA instruction) and memory loading (LDS into registers) are interleaved through a 4-stage pipeline, hiding memory latency.
- Register budgeting: the developer explicitly limits the number of registers per thread block (256 SGPR + 1024 VGPR) to avoid register spilling into L1.
- MFMA instruction selection: the tutorial uses
mfma_f32_16x16x16f16, which computes a 16×16 product block in a single clock using 16×16 FP16 inputs.
With these optimizations the kernel achieves 1.489 TFLOPS for an 8192×8192×8192 GEMM — approximately 3× faster than a naive HIP implementation (520 TFLOPS).
What do BF8 and MXFP4 extensions deliver?
BF8 (BFloat8) is an 8-bit floating-point format with a 5-bit mantissa and 3-bit exponent, designed for LLM training. MI355 supports BF8 GEMM natively in hardware via the mfma_f32_16x16x32bf8 instruction, which computes a 16×16 product in a single clock for 32×16 BF8 inputs (twice as dense as FP16 because elements are half the size). The tutorial kernel achieves 3.257 TFLOPS BF8 throughput, suitable for pre-training large language models.
MXFP4 (Microscaling FP4) is an even more aggressive 4-bit format with a shared exponent per group of values (typically 32 elements). It reduces memory bandwidth by 4× compared to FP16 while maintaining acceptable quality for LLM inference. AMD MI355 supports MXFP4 GEMM through the mfma_f32_16x16x64mxfp4 instruction. The tutorial kernel achieves 5.255 TFLOPS MXFP4 throughput, which is relevant for inference deployment of frontier models on MI355.
What does workgroup remapping and swizzle optimization do?
The CDNA 4 architecture has 8 XCD chiplets (Accelerator Compute Dies) that share L2 cache but each have their own L1. Standard linear workgroup mapping results in cache thrashing because adjacent workgroups access non-adjacent memory regions in L2.
The tutorial introduces workgroup remapping: the workgroup ID is transformed through a space-filling curve (Hilbert curve variant) so that adjacent workgroups in logical ID now map to adjacent L2 cache regions. Plus a swizzle pattern in memory access that evenly distributes accesses across HBM channels. Result: L2 cache misses drop from 5.3 million to 4.1 million for an 8192×8192 GEMM (−23 percent). In direct throughput improvement that means +6 percent for memory-bound GEMM regions.
What is the intentionally regressive example in the tutorial?
The tutorial includes a diagnostic example with incorrect register budgeting — a kernel that naively should be faster but actually drops to 73 percent of throughput (FP16 384 TFLOPS instead of 1.489). Reason: register spilling when the developer allocates too many registers per thread block.
The goal is to teach developers how to recognize and diagnose register spilling in real kernels. The tutorial shows how to use the rocprof profiler to detect spilling through concrete hardware counters, and how to modify Gluon code to resolve it. This is a rare practice in GPU tutorials that typically show only successful examples — the diagnostic example is valuable because it demonstrates how to solve real performance problems in production.
Frequently Asked Questions
- What is the Gluon programming model?
- Gluon is AMD's block-level programming model for writing GPU kernels that gives developers explicit control over layouts, pipeline stages, and register budgeting. The difference from HIP: the developer controls details that the HIP compiler normally hides, enabling higher performance at the cost of greater engineering effort.
- What does 98.75 percent MFMA efficiency mean?
- MFMA (Matrix Fused Multiply-Add) is a hardware instruction on AMD CDNA architecture for matrix multiplication. 98.75 percent efficiency means the kernel utilizes 98.75 percent of the theoretical peak throughput of the MFMA units — very close to the hardware limit.
- What is MXFP4 and why is it relevant?
- MXFP4 (Microscaling FP4) is a 4-bit floating-point format with a shared exponent per group of values. It reduces memory bandwidth by 4× compared to FP16 while maintaining acceptable quality for LLM inference. AMD MI355 supports it natively in hardware via MFMA instructions.
Related news
AMD: ROCm 7.13 brings MI350P GPU, multi-VF virtualisation and TheRock packaging
AMD ROCm: BubbleFence partitions video streams using Vision Foundation model embeddings instead of metadata heuristics
AMD ROCm: Kimi-K2.5 W4A8 and W8A8 quantization on MI325X via Quark + FlyDSL + AITER inference stack