What does mixture-of-experts mean?

Mixture-of-experts (MoE) is an architecture where a model has multiple specialized sub-models ('experts') and a router component that decides which expert will respond to each input. This allows the model to have large capacity while only activating a subset of parameters per query, saving compute resources.

What is 'catastrophic forgetting' and how does BAR solve it?

Catastrophic forgetting is the problem where new training causes a model to 'forget' old knowledge — improving math breaks code performance. BAR solves this by training each expert in isolation on its own domain, and the merging through a router means adding a new expert does not harm existing ones.

Can existing models use BAR?

The demonstration is on OLMo 2 7B, Allen Institute's open model. The methodology itself is applicable to any transformer model, but requires resources to train multiple experts in parallel. For the open-source community this is major news — it enables distributed model development where different teams contribute different experts.

Allen Institute BAR: Modular MoE Post-Training

What is BAR and how does it work?

The Allen Institute for AI published BAR (Branch-Adapt-Route) on April 20, 2026 — a new modular approach to post-training language models. Instead of the classic monolithic approach — where a single model goes through one large post-training pipeline — BAR enables independent training of multiple specialized experts:

Math
Code
Tool use (using external tools)
Safety

Each expert is trained separately on its own domain, then merged through a routing mechanism into a single unified mixture-of-experts (MoE) model. The MoE architecture means the model has multiple specialized sub-models, and the router selects which expert responds to each query.

How much does BAR improve performance?

Results on OLMo 2 7B, Allen Institute’s open model, measured across 19 benchmarks:

49.1 average score (vs 47.8 for monolithic retraining baseline)
+7.8 points for math
+4.7 points for code

A difference of 1.3 points on average may sound modest, but in domain-specific areas like math and code, an improvement of 5–8 points is significant — especially because it is achieved without degradation in other areas.

Why is modularity more important than the benchmark score?

The real breakthrough of BAR is not the benchmark score, but the possibility of incremental improvement. In the classical approach, every major improvement requires full retraining — restarting the expensive post-training process. With BAR, individual experts can be swapped or upgraded without disrupting the rest of the system:

Replacing the code expert with a new, better one: +16.5 points for code
Adding reinforcement learning (RL) for the math expert: +13 points for math

This approach resembles how software is developed — modular services upgraded independently — rather than monolithic rebuilds of the entire system.

What does it solve for the catastrophic forgetting problem?

One of the biggest problems in AI research is catastrophic forgetting: new knowledge “erases” the old. If you fine-tune a model for math, there is a real chance of worsening its capabilities in other domains (e.g., poetry, dialogue, code). This makes incremental improvement risky.

BAR elegantly solves this through expert isolation — while each expert trains in its domain, it does not touch the weights of other experts. The router only learns when to use which one. This way, specialization can be added without fear of regression.

Implications for the open-source community

For open models, BAR opens a very important possibility — distributed development. Different research teams can contribute different experts, which are then merged into a shared model. This approach could dramatically accelerate the evolution of open-source models.

In practice, the BAR authors suggest a pattern where the “base” model remains stable for a long time, and improvements come through publishing new experts. This could change how the open-source AI community collaborates — less “who has the best 7B model,” more “whose math expert is currently the best.”

Allen Institute has thus confirmed its position as one of the most important players in open AI research, with the advantage of publishing the entire methodology and expert weights.

Allen Institute BAR: Modular Post-Training with Mixture-of-Experts Delivers +7.8 Points for Math on OLMo 2 7B

What is BAR and how does it work?

How much does BAR improve performance?

Why is modularity more important than the benchmark score?

What does it solve for the catastrophic forgetting problem?

Implications for the open-source community

Sources

Related news