Allen Institute EMO: MoE with emergent modularity

EMO is a new MoE language model from the Allen Institute with 1B active and 14B total parameters, trained on 1 trillion tokens. Experts self-organize into semantic domains — with 25% of active experts the performance loss is just 1%.

The Allen Institute for AI (Ai2) published EMO on May 8, 2026 — a sparse Mixture-of-Experts (MoE) language model that develops natural semantic modularity among its experts without manual labels. The model has 1 billion active and 14 billion total parameters, with 128 experts of which 8 are active per token, and was trained on 1 trillion tokens.

How does EMO achieve emergent modularity?

EMO uses document boundaries as a weak supervision signal: all tokens within the same document must choose active experts from a shared pool. This simple restriction is sufficient for experts to self-organize into coherent groups during training that can be selectively used and combined. MoE (Mixture of Experts) is an architecture in which only a subset of all available expert networks is activated per token, enabling large capacity at lower computational cost.

What do the pruning experiment results show?

When only 25% of experts are activated (32 out of 128), EMO loses just ~1% absolute performance, while at 12.5% of experts (16 out of 128) the drop is around 3%. Standard MoE models degrade dramatically under the same conditions, suggesting that EMO has functionally separate expert subsets covering distinct thematic domains.

Into which domains do experts cluster?

Visualization of expert activation shows clusters corresponding to semantic domains: “Health, Medical & Wellness”, “News Reporting”, “US Politics & Elections”, “Film & Music”. Standard MoE instead groups tokens by surface syntax — prepositions, definite articles and punctuation are scattered across clusters.

What is publicly available?

Ai2 released the full EMO model and a comparable standard MoE baseline on Hugging Face, training code on GitHub, and an interactive visualizer (emovisualization.netlify.app) that allows real-time exploration of expert activation by domain.

Frequently Asked Questions

What is EMO and how does it differ from standard MoE models?

EMO is a sparse Mixture-of-Experts language model that develops semantic modularity without manual labels — experts cluster around domains such as medicine or politics, while standard MoE models group tokens by surface syntax.

How many parameters and experts does EMO have?

The model has 1 billion active and 14 billion total parameters, with 128 experts of which 8 are active per token. It was trained on 1 trillion tokens.

What has been released publicly?

Ai2 released the full EMO model on Hugging Face, a comparable standard MoE baseline, training code on GitHub, and an interactive visualizer at emovisualization.netlify.app.

Allen Institute: EMO — MoE language model with natural semantic modularity from data

How does EMO achieve emergent modularity?

What do the pruning experiment results show?

Into which domains do experts cluster?

What is publicly available?

Frequently Asked Questions

Sources

Related news