AWS: P-EAGLE Parallel Speculative Decoding Accelerates Inference Up to 3.97×
AWS introduced P-EAGLE, a parallel speculative decoding method that predicts all speculative tokens in a single model pass. On Qwen3-Coder-30B-A3B and NVIDIA B200 in FP8, P-EAGLE achieves 1,167 tokens per second versus 955 for EAGLE-3 (+22%), and up to 3.97× faster than baseline inference on HumanEval. Pre-trained P-EAGLE heads are available for GPT-OSS-120B, GPT-OSS-20B, Qwen3-Coder-30B-A3B, and Gemma-4-31B on Amazon SageMaker AI.
This article was generated using artificial intelligence from primary sources.
AWS introduced P-EAGLE, a parallel speculative decoding method that significantly accelerates inference for large language models on Amazon SageMaker AI.
What is speculative decoding and what does P-EAGLE change?
Speculative decoding is a technique where a smaller “drafter” model generates multiple tokens ahead of time, and the main model verifies them all at once rather than token by token. Classic EAGLE does this sequentially. P-EAGLE predicts all speculative tokens in a single model pass using learned placeholder representations, eliminating the sequential bottleneck.
How much faster is P-EAGLE?
On the Qwen3-Coder-30B-A3B model, an NVIDIA B200 GPU, and FP8 precision, P-EAGLE achieves 1,167 tokens per second versus 955 for EAGLE-3 on HumanEval — a 22% improvement. Compared to a baseline of 294 tokens per second, the speedup reaches up to 3.97× on HumanEval and 2.97× on SPEED-Bench. These are concrete gains for code generation tasks.
How is it used in practice?
AWS offers pre-trained P-EAGLE heads for GPT-OSS-120B, GPT-OSS-20B, Qwen3-Coder-30B-A3B, and Gemma-4-31B, available on Amazon SageMaker AI. This allows development teams to accelerate inference without training their own drafter, which is key to reducing costs and latency in production.
Frequently Asked Questions
- What is speculative decoding?
- An inference acceleration technique in which a smaller model proposes multiple tokens at once, and the main model verifies them in a single pass.
- How much does P-EAGLE speed up inference?
- Up to 3.97× faster than baseline inference on HumanEval, and +22% compared to EAGLE-3.
- Which models have P-EAGLE heads available?
- GPT-OSS-120B, GPT-OSS-20B, Qwen3-Coder-30B-A3B, and Gemma-4-31B on Amazon SageMaker AI.
Related news
arXiv:2606.17930: Benchmark Results Are Protocol-Dependent — Inference Compute Changes Frontier Model Rankings
Anthropic: Original Claude Sonnet 4 and Opus 4 Models Retired — Migration to Sonnet 4.6 and Opus 4.8
AWS: Google Gemma 4 Models Available on Amazon Bedrock — Three Variants Under Apache 2.0 License