AMD: FarSkip-Collective speeds up MoE inference by 18–34% on AMD GPUs
The AMD ROCm team introduced FarSkip-Collective, a modified MoE architecture that eliminates GPU idle time during Expert Parallelism communication. Results: 18% lower TTFT for Llama-4 Scout, up to 1.34× speedup for DeepSeek-V3, and 11% faster Moonlight pre-training.
This article was generated using artificial intelligence from primary sources.
What did AMD announce?
The AMD ROCm team introduced FarSkip-Collective, a modified MoE (Mixture of Experts) architecture that addresses GPU idle time during Expert Parallelism communication. The solution uses “partially or stale activation state already available” to start the next layer while communication proceeds in parallel, eliminating blocking synchronization bubbles.
What are MoE and Expert Parallelism?
MoE is an architecture in which only a subset of “experts” (specialized sub-networks) is activated per token, rather than the entire model. Expert Parallelism is the strategy of distributing those experts across multiple GPUs, which requires inter-card communication.
TTFT (Time to First Token) is the latency from a user query to the first output token — the key metric for interactive LLM applications.
How much faster is inference?
AMD reports concrete results on the ROCm platform:
- 18% lower TTFT for Llama-4 Scout
- up to 1.34× speedup for DeepSeek-V3 (671 billion parameters)
- 11% faster pre-training for the Moonlight model
- 16% additional speedup when combined with Grouped-Query-Attention
Results were measured on AMD Instinct GPUs, and the approach does not alter MoE outputs — accuracy is maintained against standard baselines.
Why does overlapping matter?
In classic Expert Parallelism, the GPU must wait for the previous layer to finish exchanging activations before the next layer starts. This creates a “bubble” — time during which compute units are idle.
FarSkip-Collective overlaps that communication with the next layer’s computation, so the GPU rarely waits. The result is higher average hardware utilization without additional cost.
Frequently Asked Questions
- What is MoE architecture?
- Mixture of Experts is an architecture where only a subset of specialized sub-networks (experts) is activated per token instead of the full model, reducing computational cost.
- How much speedup does DeepSeek-V3 get?
- Up to 1.34× faster execution for the 671-billion-parameter DeepSeek-V3 model during inference.
- Does model accuracy suffer?
- No. AMD states accuracy is maintained compared to standard MoE baseline models.
Related news
ArXiv SAGA: workflow-atomic GPU scheduling for AI agents achieves 1.64× faster task completion on a 64-GPU cluster, accepted at HPDC 2026
AMD Primus Projection: Tool for Predicting LLM Training Memory and Speed Before Running on Instinct GPU Clusters
Google at Cloud Next '26 unveils TPU 8i and TPU 8t: specialized chips for agentic AI computing