AMD: MXFP4/MXFP6 mixed-precision quantization on MI355X delivers up to 29% higher throughput
AMD demonstrated W_MXFP4_A_MXFP6 mixed-precision quantization on the Instinct MI355X accelerator, delivering up to 29% higher throughput while maintaining accuracy close to FP8 standard, using the vLLM framework for production inference.
This article was generated using artificial intelligence from primary sources.
AMD MI355X and a new quantization strategy
AMD published results for W_MXFP4_A_MXFP6 quantization — a mixed-precision technique using 4-bit weights and 6-bit neural network activations — on its Instinct MI355X AI accelerator. The goal is to strike a balance between inference speed and numerical model accuracy, using the vLLM framework in a production environment.
How much does throughput improve in practice?
On the Llama-3.1-8B model, the W_MXFP4_A_MXFP6 approach delivers +29% throughput compared to the BF16 baseline. On the larger Qwen3.6-27B model the improvement is +27%. Both results outperform pure MXFP4, which suffers from greater accuracy loss.
Accuracy: the trade-off between speed and precision
Precision stays closer to the FP8 standard than to pure MXFP4. On the Llama-3.1-8B GSM8K benchmark, mixed-precision achieves 76.42% — significantly better than pure MXFP4 at 62.55%, but slightly below FP8 at 80.44%. A similar pattern emerges on Qwen3.6-27B for the AIME26 benchmark: mixed-precision at 85.8% versus FP8 at 86.7% and pure MXFP4 dropping to 80.0%.
Latency: TTFT drops by more than a second
TTFT (Time To First Token — the delay from submitting a request to the appearance of the first generated token) on Llama-3.1-8B falls from 6,409 ms to 5,159 ms, an improvement of roughly 1.25 seconds. For production systems handling large numbers of concurrent requests, this latency reduction has a direct impact on user experience.
Conclusion: a practical production trade-off
W_MXFP4_A_MXFP6 on MI355X positions itself as a mature solution for production inference: throughput close to pure MXFP4 and accuracy close to FP8 — without having to choose between the two. AMD thereby directly competes with NVIDIA FP8 inference on H100/H200 architectures, offering an alternative within the ROCm ecosystem for organizations already using AMD hardware or looking to avoid dependence on a single GPU infrastructure vendor.
Frequently Asked Questions
- What is mixed-precision quantization and why does it matter?
- Mixed-precision quantization is an AI model compression technique where neural network weights and activations are stored in different numerical formats — for example, 4-bit weights and 6-bit activations — reducing memory footprint and accelerating inference with minimal accuracy loss.
- What is TTFT and by how much did it decrease on MI355X?
- TTFT (Time To First Token) measures the latency from sending a request to receiving the first generated token. On the Llama-3.1-8B model, AMD reduced TTFT from 6,409 ms to 5,159 ms by applying the MXFP4/MXFP6 approach.