AMD ROCm: EAGLE3 speculative decoding speeds up Kimi-K2.5 by 33% on MI325X
AMD ROCm team demonstrated EAGLE3 speculative decoding on 8× Instinct MI325X with the Kimi-K2.5 model, achieving 33% higher output throughput and 58% lower median inter-token latency without accuracy loss on the GSM8K benchmark.
This article was generated using artificial intelligence from primary sources.
EAGLE3 brings tree-based speculative decoding to AMD hardware
The AMD ROCm team has published results from implementing the EAGLE3 algorithm for accelerated inference on a cluster of 8× AMD Instinct MI325X GPUs, each with 256 GB of HBM memory (gfx942 architecture). The model under test was Kimi-K2.5 by Moonshot AI — a massive mixture-of-experts model with 497 GB of parameters, quantized in W4A8 format (INT4 weights, INT8 activations).
Speculative decoding is a technique in which a smaller, faster draft model proposes several candidate next tokens in advance, and the main (larger) model verifies them in parallel in a single pass — instead of generating each token sequentially. EAGLE3 extends this idea with a tree-based approach: it proposes a tree of multiple hypotheses at once, increasing the probability that the large model accepts a longer sequence without recomputation.
What do the measurements show?
Results were measured at concurrency=40 simultaneous requests:
- Output throughput: 672 → 895 tok/s, a gain of +33.1%
- Decode latency (TPOT): 42.73 → 27.41 ms, a drop of −35.9%
- Median inter-token latency (ITL): 27.98 → 11.75 ms, a drop of −58.0%
Without EAGLE3, each token waited an average of nearly 28 ms. With EAGLE3, that wait drops to under 12 ms — more than halved. Accuracy on the GSM8K math benchmark remains above 0.93, with no regression.
Why does this matter for the AMD ecosystem?
The result demonstrates that AMD MI325X is not merely a paper alternative to NVIDIA hardware, but can deliver concrete speedups for production MoE models through software optimizations in the ROCm stack — without changing hardware or sacrificing model quality.
Frequently Asked Questions
- What is speculative decoding and why does it speed up text generation?
- Speculative decoding is a technique where a smaller draft model quickly proposes several upcoming tokens, and the large model verifies them in parallel — instead of generating one token at a time, which reduces inter-token waiting time.
- Does the EAGLE3 speedup come at the cost of model accuracy?
- No — the GSM8K benchmark score remains above 0.93, meaning Kimi-K2.5 retains full accuracy while achieving significantly lower latency.
Related news
NVIDIA: CUDA-X libraries cuPhoton, DAQIRI, and ALCHEMI accelerate astronomy, chemistry, and materials science
NVIDIA: JUPITER — Europe's first exascale supercomputer sets scientific records at ISC 2026
NVIDIA: Vera CPU at Los Alamos — 7× faster agentic AI for nuclear science and 3 new supercomputers