What does 'training-free' mean in this context?

It means FLy requires no additional training of either the draft or target model — it can be applied to an already-trained Llama without any fine-tuning. Classic methods like EAGLE-3 require a dedicated training phase for the draft model, which costs time and GPU resources.

What is the difference between exact match and semantic acceptance?

Classic speculative decoding only accepts tokens that exactly match what the target model would generate. FLy also accepts tokens that are semantically correct even if not identical — this captures more draft predictions and accelerates generation.

Who benefits most from FLy?

Anyone serving large Llama models on AMD hardware — from research labs to production inference providers. A 3–5× speedup means proportionally lower cost per token and shorter response times, without any retraining required.

AMD FLy: 5.21× Llama Speedup Without Retraining

What is AMD FLy?

AMD researchers presented FLy on April 20, 2026 — a new speculative decoding method that works without additional model training. Speculative decoding is a technique where a smaller, faster “draft” model predicts the next several tokens in advance, while a larger “target” model verifies them in parallel — if correct, generation proceeds faster.

Until now, the best methods like EAGLE-3 required a dedicated training phase for the draft model, which is expensive and complex. FLy breaks that barrier: it achieves training-free results that surpass training-based approaches.

How does FLy accept “incorrect” tokens?

The key innovation is that FLy accepts draft tokens that are semantically correct, even when they differ from the target model’s predictions. Classic speculative decoding requires exact match — a token must be identical to what the target model would generate on its own. FLy relaxes that rule using a two-stage verification:

Entropy gate — detects ambiguity levels per token and decides when disagreement can be accepted without compromising output quality
Deferred window mechanism — temporarily accepts disagreement, then monitors the next 6 tokens for retroactive verification; if context develops correctly, the token stays, otherwise it is rolled back

This logic allows more draft predictions to pass verification, directly translating to greater speedup.

What are the actual results on Llama models?

The benchmarks AMD presents are significant:

Llama-3.3-405B — speedup of 4.80× to 5.21×
Llama-3.1-70B — speedup of 2.74×
Accuracy above 99% relative to output without speculative decoding

On the Llama-3.3 Instruct benchmark, FLy outperforms EAGLE-3, currently the leading training-based method. This is especially significant because it means a smaller team without resources to train a draft model can achieve better results than those who have that infrastructure.

Why does this matter for the AMD ecosystem?

AMD has lagged behind NVIDIA in the AI software stack for years, and ROCm optimizations are critical for competitiveness. FLy demonstrates that AMD’s research team is working on techniques specific to their hardware — not just porting NVIDIA ideas.

In practice, anyone already serving Llama models on AMD MI300X or similar GPUs can get a 3–5× speedup without retraining, without changing the model, without compromising output quality. For production systems this is a direct cost saving.

Implications for open-source inference

FLy is significant because it lowers the barrier to high-performance inference — you no longer need a specially trained draft model to achieve state-of-the-art speed. For the open-source community hosting models like Llama in their own infrastructure, this means:

Easier experimentation with large models (405B becomes accessible)
Lower cost per query in self-hosted deployments
An alternative for teams without resources for EAGLE-style training

If the method is released as an open-source implementation within the ROCm stack, it could become the standard for AMD inference deployment during 2026.

AMD FLy: Training-Free Speculative Decoding Delivers 5.21× Speedup on Llama-3.3-405B with Over 99% Accuracy

What is AMD FLy?

How does FLy accept “incorrect” tokens?

What are the actual results on Llama models?

Why does this matter for the AMD ecosystem?

Implications for open-source inference

Sources

Related news