AMD Eagle3 and Quark FP8: Speculative Decoding Delivers Up to 2.00x Throughput on MI355X
The AMD ROCm team published details on July 3, 2026 about the production deployment of Eagle3 speculative decoding on AMD hardware. The combination of Eagle3's multi-layer approach, the vLLM backend, and AMD Quark FP8 quantization achieves 1.69x to 2.00x throughput for Kimi-K2.5 and 1.38x to 1.79x for MiniMax-M2.5 on AMD Instinct MI355X, with no loss in output quality.
This article was generated using artificial intelligence from primary sources.
The AMD ROCm team published on July 3, 2026, a detailed account of the production deployment of Eagle3 speculative decoding on AMD GPU accelerators. The combination of the Eagle3 approach, the vLLM inference framework, and AMD Quark quantization tooling achieves up to 2.00× throughput for Kimi-K2.5 on AMD Instinct MI355X, with mathematically guaranteed preservation of output token quality. The work also documents a solution to a key technical obstacle that had previously prevented the simultaneous activation of Eagle3 and AITER MLA attention in vLLM.
How Does Eagle3 Accelerate Inference Without Quality Loss?
Eagle3 is a lossless inference acceleration technique that preserves the exact output distribution of the target model. Standard autoregressive LLMs generate each token individually in sequential forward passes that cannot be parallelized. Eagle3 breaks that sequentiality by introducing a smaller draft model: the draft model proposes several candidate tokens at once, and the target model verifies all proposed tokens in a single shared forward pass. Tokens accepted by the target model are included in the output; rejected tokens force normal decoding at that position. This mathematical guarantee means Eagle3 never alters the output distribution — speedup is achieved purely by reducing the number of forward passes through the target model.
Eagle3’s key innovation over earlier speculative approaches is training the draft model on multi-layer features of the target model. Instead of the draft model looking only at the last layer of representations, Eagle3 integrates low-, mid-, and high-level semantic features from the target model. High layers carry abstract semantics, mid layers carry syntactic structure, and low layers carry lexical patterns. By combining all three levels, the draft model achieves a higher token acceptance rate than simpler approaches. A higher acceptance rate translates directly into greater speedup because the target model less frequently needs to perform full corrective forward passes.
Kimi-K2.5 and MiniMax-M2.5 on AMD Instinct MI355X
Production speedup was measured on two frontier models on AMD Instinct MI355X GPU, using the InferenceX benchmark suite and the ROCm software stack.
Kimi-K2.5 with MXFP4 target precision was tested with two types of Eagle3 draft models. The BF16 Eagle3 draft achieves a throughput factor of 1.69× to 1.90× on 1K/1K workloads (1,024 input tokens, 1,024 output tokens) across concurrency levels from 4 to 64 simultaneous requests. The FP8 Eagle3 draft, quantized with AMD Quark, slightly outperforms the BF16 variant: 1.76× to 2.00×, with a maximum of 2.00× at concurrency 4.
MiniMax-M2.5 with a BF16 Eagle3 draft model achieves a throughput factor of 1.38× to 1.79× across the same concurrency levels on the same MI355X hardware. Speedup increases at lower concurrency levels, consistent with the theoretical behavior of speculative decoding: at lower concurrency, batch verification delivers relatively greater savings over the forward pass cost.
AMD Quark and the KV-Cache Incompatibility Resolution
The central contribution of this work is not merely applying Eagle3 to AMD hardware — it is the resolution of a fundamental technical obstacle. The vLLM AITER MLA backend and Eagle3 speculative decoding had a KV-cache block-size parameter incompatibility that prevented their simultaneous activation without performance degradation. AITER MLA delivers attention efficiency on long contexts, and Eagle3 accelerates sequential token generation — the combination is theoretically ideal, but was technically blocked.
AMD’s engineers resolved this incompatibility, enabling both optimizations to run together without any configuration compromises. The production configuration uses the ROCm stack, vLLM with the AITER MLA backend, and Eagle3 draft model with no special workarounds required.
Using the AMD Quark quantization tool, the Kimi-K2.5 Eagle3 draft model was quantized to FP8 precision, with the LM head layer retained at higher precision for stability. The FP8 draft model not only uses less GPU memory but marginally outperforms the BF16 variant in measurements. This finding suggests that the quantization noise of the FP8 draft model does not degrade the token acceptance rate in this context — or is at least statistically neutral for the given models and workloads. The target hardware for all production configurations is AMD Instinct MI350X and MI355X accelerators. The work demonstrates that combining Eagle3 with FP8 quantization is not a trade-off between speed and quality, but a simultaneous improvement on both fronts: a smaller draft model memory footprint leaves more HBM capacity for the target model, while speculative decoding reduces the total number of expensive forward passes per generated token.
Frequently Asked Questions
- How does Eagle3 accelerate inference without degrading quality?
- Eagle3 uses a smaller draft model that proposes several candidate tokens at once, while the target model verifies all proposed tokens in a single forward pass. Rejected tokens force normal decoding, so the output distribution remains mathematically identical to the original — no quality loss.
- What does AMD Quark bring to Eagle3?
- AMD Quark quantized the Kimi-K2.5 draft model to FP8 precision with the LM head kept at higher precision. The FP8 draft model uses less GPU memory and in measurements slightly outperforms the BF16 variant, achieving a maximum of 2.00x throughput on MI355X.
- On which models and hardware was the speedup demonstrated?
- Kimi-K2.5 (MXFP4 target) achieves 1.69x to 2.00x throughput, and MiniMax-M2.5 (BF16) achieves 1.38x to 1.79x, all measured on AMD Instinct MI355X with the ROCm stack and vLLM backend using AITER MLA attention.