AMD: Gluon block-level model enables GEMM kernels with 5.255 TFLOPS MXFP4 on Instinct MI355
The AMD ROCm team published a tutorial for writing high-performance GEMM kernels in the Gluon programming model on the MI355 GPU. An optimized FP16 kernel achieves 1.489 TFLOPS at 98.75 percent MFMA efficiency, while extensions to BF8 (3.257 TFLOPS) and MXFP4 (5.255 TFLOPS) demonstrate relevance for modern AI workloads. The tutorial includes workgroup remapping and swizzle that reduces L2 cache misses from 5.3 M to 4.1 M.