Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

Jacob Morrison; Sanjay Adhikesaven; Akshita Bhagia; Matei Zaharia; Noah A. Smith; Sewon Min

✨ TL;DR

BAR (Branch-Adapt-Route) trains separate domain experts independently and combines them via Mixture-of-Experts, enabling modular updates to language models without retraining everything or degrading existing capabilities. This approach matches monolithic retraining performance while scaling linearly instead of quadratically when adding new domains.

01 · Problem

Extending post-trained language models with new capabilities faces a fundamental trade-off: retraining from scratch on all domains together is computationally expensive and scales poorly (cost grows quadratically with each new domain), while continued training on new domains often causes catastrophic forgetting and degrades existing capabilities. Monolithic training paradigms require full reprocessing of all data whenever any domain is updated, making iterative development impractical at scale. This creates a significant barrier to efficiently maintaining and extending large language models as new domain requirements emerge.

02 · Approach

BAR trains independent domain experts, where each expert undergoes its own complete post-training pipeline including mid-training, supervised finetuning, and reinforcement learning. These separately trained experts are then composed using a Mixture-of-Experts (MoE) architecture with lightweight router training to direct inputs to appropriate experts. The modular design allows individual experts to be updated or added independently without affecting other domains. The authors evaluate this approach at 7B scale with four domain experts: math, code, tool use, and safety, comparing against monolithic retraining baselines both with and without mid-training.

03 · Key insights

What the paper shows.

01Modular expert training scales linearly with new domains, while monolithic retraining scales quadratically, providing significant cost advantages as models expand to more domains

02Isolating domains into separate experts prevents catastrophic forgetting that occurs when late-stage reinforcement learning degrades capabilities acquired in earlier training stages

03Individual experts can be updated independently without requiring reprocessing of other domains or degrading their performance, enabling efficient iterative development

04Lightweight router training is sufficient to effectively compose independently trained experts, achieving performance comparable to full monolithic retraining

04 · Results

At the 7B parameter scale with four domain experts (math, code, tool use, and safety), BAR achieved an overall score of 49.1 averaged across 7 evaluation categories. This performance matched or exceeded monolithic retraining baselines, which scored 47.8 without mid-training and 50.5 with mid-training. The results demonstrate that modular expert-based training can achieve competitive performance while providing structural advantages in cost scaling and avoiding catastrophic forgetting during updates.

05 · Limitations

The paper evaluates BAR only at the 7B parameter scale with four specific domains, leaving questions about scalability to larger models and more numerous domains. The approach requires maintaining multiple expert models rather than a single monolithic model, which may increase deployment complexity and memory requirements. The effectiveness of the lightweight router training may depend on domain separability, and performance on tasks requiring cross-domain reasoning is not extensively explored. The paper does not provide detailed analysis of router behavior or failure modes when domain boundaries are ambiguous.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers