Marco-MoE: open multilingual MoE, 5% activations, 5T tokens

Marco-MoE is a new open-source family of sparse Mixture-of-Experts models published on April 28, 2026, by a team led by Jiang, Zhao, and colleagues. The models activate only about 5% of total parameters per token, are trained via upcycling from dense models on 5 trillion tokens, and the Instruct variants outperform dense competitors with 3 to 14 times more activated parameters. Weights, dataset, and training recipe are publicly released.

A team of eight researchers (Fan Jiang, Yu Zhao, Chenyang Lyu, Tianqi Shi, Yichao Du, Feihu Jiang, Longyue Wang, Weihua Luo) published the preprint Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling on April 28, 2026. This is one of the rare fully open MoE models — including the training dataset, recipe, and weights.

What Is Sparse MoE?

Mixture-of-Experts (MoE) is an architecture in which the model has multiple “experts” — parallel MLP modules — of which the router activates only a few for each token. Marco-MoE has an extremely sparse design that activates only about 5% of total parameters per input token, enabling efficient scaling of total capacity without a proportional increase in inference cost.

Upcycling as the Training Recipe

Instead of training from scratch, the authors use upcycling: existing dense models are converted to MoE architecture by copying MLP layers into experts and adding a router network. This is followed by 5 trillion tokens of additional pre-training. The dense models used as seeds are not explicitly named in the abstract, but the approach has been proven effective in prior work (Mixtral, Qwen-MoE).

Marco-MoE-Instruct variants, obtained through post-training, outperform models with 3 to 14 times more active parameters on English and multilingual benchmarks. Concrete parameter counts (e.g., 7B active, 56B total) are not given in the retrieved abstract.

What Do They Say About Languages?

The most interesting part of the analysis: Marco-MoE learns structured expert activation patterns shared by related languages, while linguistically isolated languages receive highly specialized experts. The authors demonstrate that this enables scalable language extension without the interference typical of dense models — which is a serious operational property for multilingual deployment.

Why Is the Release Significant?

Chinese teams (Qwen, DeepSeek, Yi) have taken the lead in open-weight models during 2025-2026, but few release the complete stack — weights, dataset, and recipe. Marco-MoE belongs to that rare category of full openness, enabling the research community to independently replicate and build derivative models.

Frequently Asked Questions

What is MoE 'upcycling'?

A technique where an existing dense model is converted into a MoE architecture by copying MLP layers into experts and adding a router network. Compute is saved by not training a MoE from scratch, and the dense model's quality is carried over.

What has been publicly released?

Complete training datasets, recipes (procedures and hyperparameters), and model weights in both base and Instruct variants. This enables independent replication and fine-tuning on custom domains.

What are the multilingual characteristics?

Analysis shows that Marco-MoE learns structured expert activation patterns shared by related languages, while linguistically isolated languages receive highly specialized experts. This enables scalable language extension without the interference typical of dense models.

Marco-MoE: Open-Source Multilingual MoE with 5% Active Parameters Outperforms Dense Models with 3-14× More Activations

What Is Sparse MoE?

Upcycling as the Training Recipe

What Do They Say About Languages?

Why Is the Release Significant?

Frequently Asked Questions

Sources

Related news