What is Flow Matching?

Flow Matching is a generative modeling method that directly learns a vector field mapping one distribution to another, similar to diffusion models but requiring fewer integration steps.

Why is YAN faster than diffusion LMs?

Because it decomposes global transport into specialized experts (MoE), reducing the number of required integration steps to just 3, while diffusion LMs typically require hundreds of steps.

YAN: Mixture-of-Experts Flow Matching Achieves 40× Speedup Over Autoregressive LMs with Just 3 Sampling Steps

What Is YAN and What Does It Do?

YAN is a new language model introduced in the paper “Towards Faster Language Model Inference via MoE Flow Matching,” combining two architectures: Transformer (standard attention-based) and Mamba (state-space model with linear scaling). This hybrid drives a Mixture-of-Experts (MoE) Flow Matching framework — a generative modeling approach where the model does not generate tokens one by one autoregressively, but instead learns a transport vector field that converts noise into coherent text in parallel.

The key innovation is decomposition: rather than a single global flow field, YAN learns multiple locally specialized vector fields via the MoE mechanism. Each expert covers a narrower geometric region of the latent space, addressing the problem of anisotropic (direction-dependent) and multimodal distributions that plague standard flow matching models for language.

How Much Is Actually Saved?

The numbers are dramatic:

40× speedup over autoregressive (AR) baselines of the same size
Up to 1000× speedup over diffusion language models
Only 3 sampling steps instead of hundreds in diffusion LMs
Quality comparable to AR models (per the author’s evaluation)

For context, a standard autoregressive LLM generates one token per forward pass through the entire model. YAN generates entire sequences in 3 parallel steps — which in theory means batch sizes can scale explosively without a linear increase in latency.

Why Could This Be Important?

The autoregressive paradigm has dominated language modeling for the past seven years because, despite slow inference, it is the easiest to train on available GPU clusters. Diffusion LMs (such as Mercury or LLaDA) promise parallelism, but hundreds of sampling steps still make them impractical.

YAN’s approach — flow matching with locally specialized MoE experts — could be a third path that retains the speed advantage of diffusion with fewer steps. If the results replicate at larger scale, the door opens to a generation of models where inference latency is measured in milliseconds per response, not seconds.

What Still Needs to Be Proven?

Author Aihua Li presents the work as an arXiv preprint without an explicitly listed peer-review publication. Key open questions:

Scaling: Is this a demonstration on smaller models (up to a few billion parameters), or are the results robust at 70B+ scale?
Task complexity: Does YAN match AR model quality on complex reasoning and long-context tasks, not just shorter sequence generation?
Open code: If the author publishes an implementation, many of these questions will become clear very quickly.

For now, YAN is a theoretically intriguing signal that the autoregressive paradigm has serious competition.

YAN: Mixture-of-Experts Flow Matching Achieves 40× Speedup Over Autoregressive LMs with Just 3 Sampling Steps

What Is YAN and What Does It Do?

How Much Is Actually Saved?

Why Could This Be Important?

What Still Needs to Be Proven?

Sources

Related news