YAN: Mixture-of-Experts Flow Matching Achieves 40× Speedup Over Autoregressive LMs with Just 3 Sampling Steps
Why it matters
YAN is a new generative language model that combines Transformer and Mamba architectures with a Mixture-of-Experts Flow Matching approach — achieving quality comparable to autoregressive models in just 3 sampling steps, delivering a 40× speedup over AR baselines and up to 1000× over diffusion language models. The model decomposes global transport geometries into locally specialized vector fields.
What Is YAN and What Does It Do?
YAN is a new language model introduced in the paper “Towards Faster Language Model Inference via MoE Flow Matching,” combining two architectures: Transformer (standard attention-based) and Mamba (state-space model with linear scaling). This hybrid drives a Mixture-of-Experts (MoE) Flow Matching framework — a generative modeling approach where the model does not generate tokens one by one autoregressively, but instead learns a transport vector field that converts noise into coherent text in parallel.
The key innovation is decomposition: rather than a single global flow field, YAN learns multiple locally specialized vector fields via the MoE mechanism. Each expert covers a narrower geometric region of the latent space, addressing the problem of anisotropic (direction-dependent) and multimodal distributions that plague standard flow matching models for language.
How Much Is Actually Saved?
The numbers are dramatic:
- 40× speedup over autoregressive (AR) baselines of the same size
- Up to 1000× speedup over diffusion language models
- Only 3 sampling steps instead of hundreds in diffusion LMs
- Quality comparable to AR models (per the author’s evaluation)
For context, a standard autoregressive LLM generates one token per forward pass through the entire model. YAN generates entire sequences in 3 parallel steps — which in theory means batch sizes can scale explosively without a linear increase in latency.
Why Could This Be Important?
The autoregressive paradigm has dominated language modeling for the past seven years because, despite slow inference, it is the easiest to train on available GPU clusters. Diffusion LMs (such as Mercury or LLaDA) promise parallelism, but hundreds of sampling steps still make them impractical.
YAN’s approach — flow matching with locally specialized MoE experts — could be a third path that retains the speed advantage of diffusion with fewer steps. If the results replicate at larger scale, the door opens to a generation of models where inference latency is measured in milliseconds per response, not seconds.
What Still Needs to Be Proven?
Author Aihua Li presents the work as an arXiv preprint without an explicitly listed peer-review publication. Key open questions:
- Scaling: Is this a demonstration on smaller models (up to a few billion parameters), or are the results robust at 70B+ scale?
- Task complexity: Does YAN match AR model quality on complex reasoning and long-context tasks, not just shorter sequence generation?
- Open code: If the author publishes an implementation, many of these questions will become clear very quickly.
For now, YAN is a theoretically intriguing signal that the autoregressive paradigm has serious competition.
This article was generated using artificial intelligence from primary sources.
Related news
Thinking with Reasoning Skills (ACL 2026 Industry Track): fewer tokens, higher accuracy through retrieval of reasoning skills
DeepSeek releases V4-Pro and V4-Flash: two open-source models with one million token context and 80.6 on SWE Verified
OpenAI introduces GPT-5.5: the smartest model for coding, research, and complex data analysis through tools