🟡 🔧 Hardware Published: · 2 min read ·

AMD: MXFP4/MXFP6 mixed-precision quantization on MI355X delivers up to 29% higher throughput

Editorial illustration: AMD Instinct MI355X GPU accelerator with mixed-precision quantization diagram and throughput graph

AMD demonstrated W_MXFP4_A_MXFP6 mixed-precision quantization on the Instinct MI355X accelerator, delivering up to 29% higher throughput while maintaining accuracy close to FP8 standard, using the vLLM framework for production inference.

🤖

This article was generated using artificial intelligence from primary sources.

AMD MI355X and a new quantization strategy

AMD published results for W_MXFP4_A_MXFP6 quantization — a mixed-precision technique using 4-bit weights and 6-bit neural network activations — on its Instinct MI355X AI accelerator. The goal is to strike a balance between inference speed and numerical model accuracy, using the vLLM framework in a production environment.

How much does throughput improve in practice?

On the Llama-3.1-8B model, the W_MXFP4_A_MXFP6 approach delivers +29% throughput compared to the BF16 baseline. On the larger Qwen3.6-27B model the improvement is +27%. Both results outperform pure MXFP4, which suffers from greater accuracy loss.

Accuracy: the trade-off between speed and precision

Precision stays closer to the FP8 standard than to pure MXFP4. On the Llama-3.1-8B GSM8K benchmark, mixed-precision achieves 76.42% — significantly better than pure MXFP4 at 62.55%, but slightly below FP8 at 80.44%. A similar pattern emerges on Qwen3.6-27B for the AIME26 benchmark: mixed-precision at 85.8% versus FP8 at 86.7% and pure MXFP4 dropping to 80.0%.

Latency: TTFT drops by more than a second

TTFT (Time To First Token — the delay from submitting a request to the appearance of the first generated token) on Llama-3.1-8B falls from 6,409 ms to 5,159 ms, an improvement of roughly 1.25 seconds. For production systems handling large numbers of concurrent requests, this latency reduction has a direct impact on user experience.

Conclusion: a practical production trade-off

W_MXFP4_A_MXFP6 on MI355X positions itself as a mature solution for production inference: throughput close to pure MXFP4 and accuracy close to FP8 — without having to choose between the two. AMD thereby directly competes with NVIDIA FP8 inference on H100/H200 architectures, offering an alternative within the ROCm ecosystem for organizations already using AMD hardware or looking to avoid dependence on a single GPU infrastructure vendor.

Frequently Asked Questions

What is mixed-precision quantization and why does it matter?
Mixed-precision quantization is an AI model compression technique where neural network weights and activations are stored in different numerical formats — for example, 4-bit weights and 6-bit activations — reducing memory footprint and accelerating inference with minimal accuracy loss.
What is TTFT and by how much did it decrease on MI355X?
TTFT (Time To First Token) measures the latency from sending a request to receiving the first generated token. On the Llama-3.1-8B model, AMD reduced TTFT from 6,409 ms to 5,159 ms by applying the MXFP4/MXFP6 approach.