🤖 24 AI
🟡 📦 Open Source Tuesday, April 21, 2026 · 3 min read

AMD FLy: Training-Free Speculative Decoding Delivers 5.21× Speedup on Llama-3.3-405B with Over 99% Accuracy

Editorial illustration of speculative decoding — draft model proposes tokens, target model verifies them in parallel

Why it matters

AMD FLy is a new training-free speculative decoding method that achieves 4.80× to 5.21× speedup on Llama-3.3-405B and 2.74× on Llama-3.1-70B through semantic acceptance of draft tokens, with accuracy above 99%, requiring no additional model training.

What is AMD FLy?

AMD researchers presented FLy on April 20, 2026 — a new speculative decoding method that works without additional model training. Speculative decoding is a technique where a smaller, faster “draft” model predicts the next several tokens in advance, while a larger “target” model verifies them in parallel — if correct, generation proceeds faster.

Until now, the best methods like EAGLE-3 required a dedicated training phase for the draft model, which is expensive and complex. FLy breaks that barrier: it achieves training-free results that surpass training-based approaches.

How does FLy accept “incorrect” tokens?

The key innovation is that FLy accepts draft tokens that are semantically correct, even when they differ from the target model’s predictions. Classic speculative decoding requires exact match — a token must be identical to what the target model would generate on its own. FLy relaxes that rule using a two-stage verification:

  • Entropy gate — detects ambiguity levels per token and decides when disagreement can be accepted without compromising output quality
  • Deferred window mechanism — temporarily accepts disagreement, then monitors the next 6 tokens for retroactive verification; if context develops correctly, the token stays, otherwise it is rolled back

This logic allows more draft predictions to pass verification, directly translating to greater speedup.

What are the actual results on Llama models?

The benchmarks AMD presents are significant:

  • Llama-3.3-405B — speedup of 4.80× to 5.21×
  • Llama-3.1-70B — speedup of 2.74×
  • Accuracy above 99% relative to output without speculative decoding

On the Llama-3.3 Instruct benchmark, FLy outperforms EAGLE-3, currently the leading training-based method. This is especially significant because it means a smaller team without resources to train a draft model can achieve better results than those who have that infrastructure.

Why does this matter for the AMD ecosystem?

AMD has lagged behind NVIDIA in the AI software stack for years, and ROCm optimizations are critical for competitiveness. FLy demonstrates that AMD’s research team is working on techniques specific to their hardware — not just porting NVIDIA ideas.

In practice, anyone already serving Llama models on AMD MI300X or similar GPUs can get a 3–5× speedup without retraining, without changing the model, without compromising output quality. For production systems this is a direct cost saving.

Implications for open-source inference

FLy is significant because it lowers the barrier to high-performance inference — you no longer need a specially trained draft model to achieve state-of-the-art speed. For the open-source community hosting models like Llama in their own infrastructure, this means:

  • Easier experimentation with large models (405B becomes accessible)
  • Lower cost per query in self-hosted deployments
  • An alternative for teams without resources for EAGLE-style training

If the method is released as an open-source implementation within the ROCm stack, it could become the standard for AMD inference deployment during 2026.

🤖

This article was generated using artificial intelligence from primary sources.