DeepSpeed gets the Muon optimizer: 35% faster training with less memory
The PyTorch Blog announced on 3 June 2026 that DeepSpeed has gained full support for the Muon optimizer in a hybrid implementation. Muon keeps only one momentum buffer per parameter, reducing optimizer memory by about 45%, and on the NanoGPT benchmark it trains 35% faster than AdamW. The technique is already used by the Kimi-K2, GLM-5 and DeepSeek-V4 models.
This article was generated using artificial intelligence from primary sources.
DeepSpeed has gained full support for the Muon optimizer, the PyTorch Blog announced on 3 June 2026. Muon is an optimizer (an algorithm for updating model weights during training) that promises faster training with significantly lower memory usage than the standard AdamW, and its integration into DeepSpeed makes it easier to apply to large-scale models.
What does the Muon optimizer bring?
Muon’s key advantage is that it keeps only one momentum buffer (a buffer for accumulated gradients) per parameter, while AdamW keeps two. As a result, the memory the optimizer consumes is about 45% smaller. When training large models, memory is often the bottleneck, so this saving directly enables larger models or larger batches on the same hardware.
DeepSpeed does not apply Muon in isolation but in a hybrid implementation. Muon is used for 2D weights in the attention and MLP layers, while for embeddings and normalization layers AdamW takes over as a fallback. This approach maintains stability on the layers that Muon does not suit and achieves savings where it is most effective.
How much faster is Muon than AdamW?
On the NanoGPT benchmark, Muon trains 35% faster than AdamW. Moreover, it reaches the performance of the GPT-2 XL model about 25% earlier than AdamW, which means it arrives at the same quality with fewer training steps. A faster path to the goal and lower memory usage together reduce both the time and the cost of training.
These figures relate to a reference benchmark, but the direction is clear: Muon offers a concrete advantage in efficiency, not just a theoretical one.
How does Muon perform in fine-tuning?
When fine-tuning the Moonlight-16B-A3B model, which has an MoE architecture (Mixture of Experts, a model with multiple specialized subnetworks), Muon beats AdamW on three out of four measured metrics. On MMLU it achieves 0.678 versus 0.660, on MBPP+ 0.548 versus 0.534, and on GSM8K 0.810 versus 0.805. The differences are moderate, but consistently in Muon’s favor.
The memory advantage has also been confirmed in practice: on the Qwen2.5-3B model a saving of 9%, or about 3 GiB, was measured. This confirms that the declared saving is not merely theoretical but measurable on concrete models.
Who already uses Muon?
Muon is not an experiment but a proven optimizer in training models of the largest scale. It is already used by Kimi-K2 with a trillion (1T) parameters, GLM-5 with 744 billion parameters, and DeepSeek-V4 with 1.6 trillion (1.6T) parameters. The fact that models of this scope have adopted it is a strong signal of its reliability.
By arriving in DeepSpeed, one of the most widely used frameworks for training large models, Muon becomes available to a wider circle of researchers and teams who want to reduce costs and speed up training without loss of quality.
Frequently Asked Questions
- How much does Muon speed up training compared to AdamW?
- On the NanoGPT benchmark, Muon trains 35% faster than AdamW and reaches GPT-2 XL performance about 25% earlier. The savings also come from lower memory usage, because Muon keeps only one momentum buffer per parameter.
- Why does Muon use less memory than AdamW?
- Muon keeps only one momentum buffer (a buffer for accumulated gradients) per parameter, while AdamW keeps two. As a result, optimizer memory is about 45% smaller, and on the Qwen2.5-3B model a saving of 9%, or about 3 GiB, was measured.
- How does DeepSpeed combine Muon and AdamW?
- DeepSpeed uses a hybrid approach: Muon is applied to the 2D weights of the attention and MLP layers, while for embeddings and normalization layers it uses AdamW as a fallback. This yields memory savings without loss of stability on the layers that Muon does not suit.
- Which large models already use Muon?
- Muon is already in use in several large models: Kimi-K2 (1 trillion parameters), GLM-5 (744 billion) and DeepSeek-V4 (1.6 trillion). This shows that the optimizer is proven in training models of the largest scale.
Related news
Anthropic: Claude Code v2.1.183 Blocks Destructive Git and Infrastructure Commands in Auto Mode
AWS: SageMaker Gets Over 100 Detailed Inference Metrics and an Insights Dashboard on CloudWatch
GitHub: Copilot Retires Opus 4.6 (fast) on June 29, Adds AGENTS.md to Code Review and ai_credits_used Field to API