✨ TL;DR
BAR (Branch-Adapt-Route) trains separate domain experts independently and combines them via Mixture-of-Experts, enabling modular updates to language models without retraining everything or degrading existing capabilities. This approach matches monolithic retraining performance while scaling linearly instead of quadratically when adding new domains.
Extending post-trained language models with new capabilities faces a fundamental trade-off: retraining from scratch on all domains together is computationally expensive and scales poorly (cost grows quadratically with each new domain), while continued training on new domains often causes catastrophic forgetting and degrades existing capabilities. Monolithic training paradigms require full reprocessing of all data whenever any domain is updated, making iterative development impractical at scale. This creates a significant barrier to efficiently maintaining and extending large language models as new domain requirements emerge.
BAR trains independent domain experts, where each expert undergoes its own complete post-training pipeline including mid-training, supervised finetuning, and reinforcement learning. These separately trained experts are then composed using a Mixture-of-Experts (MoE) architecture with lightweight router training to direct inputs to appropriate experts. The modular design allows individual experts to be updated or added independently without affecting other domains. The authors evaluate this approach at 7B scale with four domain experts: math, code, tool use, and safety, comparing against monolithic retraining baselines both with and without mid-training.
What the paper shows.
At the 7B parameter scale with four domain experts (math, code, tool use, and safety), BAR achieved an overall score of 49.1 averaged across 7 evaluation categories. This performance matched or exceeded monolithic retraining baselines, which scored 47.8 without mid-training and 50.5 with mid-training. The results demonstrate that modular expert-based training can achieve competitive performance while providing structural advantages in cost scaling and avoiding catastrophic forgetting during updates.
The paper evaluates BAR only at the 7B parameter scale with four specific domains, leaving questions about scalability to larger models and more numerous domains. The approach requires maintaining multiple expert models rather than a single monolithic model, which may increase deployment complexity and memory requirements. The effectiveness of the lightweight router training may depend on domain separability, and performance on tasks requiring cross-domain reasoning is not extensively explored. The paper does not provide detailed analysis of router behavior or failure modes when domain boundaries are ambiguous.
✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.