AMD ROCm: 4-wave FP8 GEMM for MI355X — 512 VGPR per wave

The AMD ROCm blog describes a new 4-wave interleave approach for FP8 GEMM (mixed-precision matrix multiplication) kernels on the Instinct MI355X accelerator with CDNA 4 architecture. The optimization shifts from an 8-wave ping-pong layout to 4 waves with a full VGPR budget of 512 registers, eliminating LDS memory bank conflicts through XOR-based swizzle and hiding memory latency by precisely overlapping MFMA instructions with data loads.

The AMD ROCm team published a detailed technical description of a new optimization strategy for FP8 GEMM (General Matrix Multiply) kernels on AMD Instinct MI355X graphics processors with CDNA 4 architecture. The optimization goes beyond the standard 8-wave ping-pong approach by introducing a 4-wave interleave design that dramatically increases the register budget per wave and eliminates memory conflicts at the hardware level.

Why does the 4-wave approach outperform the 8-wave for FP8 GEMM?

The key difference lies in VGPR (Vector General Purpose Registers) allocation. In the 8-wave design, which places “two waves per SIMD unit,” each wave must share register space — limiting the size of the output tile it can process. The 4-wave approach assigns one wave per SIMD unit, giving it access to the full budget of 512 VGPR registers.

This directly affects the processable tile size: the 4-wave approach can use 128x128 output tiles versus 64x128 in the 8-wave design. The global tile for the entire kernel is 256x256x128, and each of the four waves processes four 64x64 blocks using 16x16 MFMA (Matrix Fused Multiply-Add) operations within each block.

How does 4-wave interleave eliminate memory latency on MI355X?

The central challenge in GPU matrix multiplication kernels is hiding memory latency — the GPU must stay busy with MFMA computation while waiting for data to load from memory. The 4-wave approach solves this through precise interleaving of instructions at sub-tile granularity, without relying on the hardware scheduler.

The implementation uses __builtin_amdgcn_sched_barrier(0) to enforce instruction ordering — the compiler is prohibited from grouping all memory operations after compute ones. Each interleaved_block call overlaps 16 MFMA instructions with 8 LDS-to-register loads (ds_read_b128), achieving consistent overlap without ping-pong synchronization complexity.

How does XOR-based LDS swizzle eliminate memory bank conflicts?

MI355X LDS (Local Data Share) memory is organized into 64 banks of 4 bytes each, with a period of 256 bytes. The problem is structural: 16x16x128 MFMA instructions inherently access matrix columns (column-major), which under standard layout causes 8-way bank conflicts — eight accesses simultaneously waiting for the same bank.

AMD resolves this with a mathematical offset transformation:

row_bits = (offset % (16 * 128) >> 7) / 2
mask = row_bits << 4
swizzled_offset = offset ^ mask

The XOR operation remaps addresses so all four phases of ds_read_b128 instructions proceed without any bank conflicts. The result is full LDS memory throughput instead of an effective 12.5% in the case of 8-way conflicts.

Hardware platform and chiplet-aware optimization

All optimizations target the AMD Instinct MI355X with CDNA 4 architecture. The MI355X has an 8-chiplet (XCD) topology — each XCD has its own L2 cache, and the Last Level Cache is shared across XCDs. The ROCm team adds a chiplet-aware grid swizzle that remaps thread block assignment using a Morton curve-like mapping, maximizing L2 utilization within a single XCD and LLC usage across XCDs.

Benchmarking used 1000 warm-up and 1000 benchmark iterations on normally distributed data with rotating buffers. The ROCm team reports “consistent performance across different ROCm versions” without manual pragma tuning — unlike 8-wave variants that were sensitive to compiler changes.

Frequently Asked Questions

What is the 4-wave interleave FP8 GEMM optimization for AMD CDNA 4?

It is a GPU kernel optimization for matrix multiplication (GEMM) in FP8 precision on the AMD Instinct MI355X. Instead of 8 waves sharing registers, it uses 4 waves each with a full budget of 512 VGPR registers, enabling larger output tiles (128x128 vs. 64x128) and better memory latency hiding.

Why does AMD use XOR-based swizzle in these GEMM kernels?

MI355X LDS memory has 64 banks, and 16x16x128 MFMA instructions naturally access columns (column-major), causing 8-way bank conflicts. XOR swizzle remaps offsets to eliminate conflicts and ensures conflict-free ds_read_b128 access across all four phases.

Which GPU hardware do these optimizations target?

The optimizations target the AMD Instinct MI355X using CDNA 4 architecture with an 8-chiplet (XCD) topology. Beyond kernel optimization, a chiplet-aware grid swizzle maximizes L2 cache utilization per XCD and Last Level Cache usage across XCDs.

AMD ROCm: 4-Wave Interleave Optimization of FP8 GEMM Kernels for Instinct MI355X (CDNA 4) Doubles Register Budget per Wave

Why does the 4-wave approach outperform the 8-wave for FP8 GEMM?

How does 4-wave interleave eliminate memory latency on MI355X?

How does XOR-based LDS swizzle eliminate memory bank conflicts?

Hardware platform and chiplet-aware optimization

Frequently Asked Questions

Sources

Related news