Mixture of Experts (MoE)

A Mixture of Experts (MoE) model contains a large number of “expert” sub-networks plus a learned router that selects which experts process each token. For any given input, only a few experts (typically 2 of 8, 8 of 64, or similar) activate; the rest stay dormant. The model has the parameter count and capacity of a huge dense model but the compute cost of a much smaller one.

Why it matters: scaling dense transformers hits a wall — bigger models become prohibitively expensive to run. MoE breaks the link between total parameters and active parameters per token, letting you keep growing capacity without proportional inference cost.

Notable MoE models:

Mixtral 8x7B and 8x22B (Mistral AI, open-weight)
DeepSeek-V3 / DeepSeek-R1 (671B total, ~37B active)
Llama 4 family (Meta, MoE adopted in 2025)
GPT-4 / GPT-5 rumored MoE (not officially confirmed)
Qwen MoE series (Alibaba)

Trade-offs: MoE training is more complex (load balancing, expert collapse), inference servers need higher peak memory (all experts must fit), and not every workload benefits equally. By 2026, MoE has become the default architecture for top open-weight and closed frontier models.

Sources

See also