🟡 🤖 Models Published: · 1 min read ·

AWS: P-EAGLE Parallel Speculative Decoding Accelerates Inference Up to 3.97×

Editorial illustration: accelerated parallel token decoding in AI inference

AWS introduced P-EAGLE, a parallel speculative decoding method that predicts all speculative tokens in a single model pass. On Qwen3-Coder-30B-A3B and NVIDIA B200 in FP8, P-EAGLE achieves 1,167 tokens per second versus 955 for EAGLE-3 (+22%), and up to 3.97× faster than baseline inference on HumanEval. Pre-trained P-EAGLE heads are available for GPT-OSS-120B, GPT-OSS-20B, Qwen3-Coder-30B-A3B, and Gemma-4-31B on Amazon SageMaker AI.

🤖

This article was generated using artificial intelligence from primary sources.

AWS introduced P-EAGLE, a parallel speculative decoding method that significantly accelerates inference for large language models on Amazon SageMaker AI.

What is speculative decoding and what does P-EAGLE change?

Speculative decoding is a technique where a smaller “drafter” model generates multiple tokens ahead of time, and the main model verifies them all at once rather than token by token. Classic EAGLE does this sequentially. P-EAGLE predicts all speculative tokens in a single model pass using learned placeholder representations, eliminating the sequential bottleneck.

How much faster is P-EAGLE?

On the Qwen3-Coder-30B-A3B model, an NVIDIA B200 GPU, and FP8 precision, P-EAGLE achieves 1,167 tokens per second versus 955 for EAGLE-3 on HumanEval — a 22% improvement. Compared to a baseline of 294 tokens per second, the speedup reaches up to 3.97× on HumanEval and 2.97× on SPEED-Bench. These are concrete gains for code generation tasks.

How is it used in practice?

AWS offers pre-trained P-EAGLE heads for GPT-OSS-120B, GPT-OSS-20B, Qwen3-Coder-30B-A3B, and Gemma-4-31B, available on Amazon SageMaker AI. This allows development teams to accelerate inference without training their own drafter, which is key to reducing costs and latency in production.

Frequently Asked Questions

What is speculative decoding?
An inference acceleration technique in which a smaller model proposes multiple tokens at once, and the main model verifies them in a single pass.
How much does P-EAGLE speed up inference?
Up to 3.97× faster than baseline inference on HumanEval, and +22% compared to EAGLE-3.
Which models have P-EAGLE heads available?
GPT-OSS-120B, GPT-OSS-20B, Qwen3-Coder-30B-A3B, and Gemma-4-31B on Amazon SageMaker AI.