When Can LLMs Learn to Reason with Weak Supervision?

Salman Rahman; Jingyan Shen; Anna Mordvina; Hamid Palangi; Saadia Gabriel; Pavel Izmailov

✨ TL;DR

This paper investigates when reinforcement learning with verifiable rewards (RLVR) enables large language models to generalize under weak supervision (scarce data, noisy rewards, or self-supervised signals). The key finding is that models generalize when they exhibit prolonged pre-saturation training dynamics, which is predicted by reasoning faithfulness—the degree to which intermediate reasoning steps logically support final answers.

01 · Problem

Large language models have improved reasoning through reinforcement learning with verifiable rewards, but creating high-quality reward signals becomes harder as models advance. Understanding when RLVR succeeds with weaker supervision is critical for scaling these methods. The paper addresses three challenging weak supervision scenarios: limited training data, noisy reward signals, and self-supervised proxy rewards that may not perfectly align with true task objectives. Without clear understanding of what enables generalization in these settings, practitioners risk models that memorize training patterns rather than learning generalizable reasoning strategies.

02 · Approach

The authors conduct a systematic empirical study across multiple model families and reasoning domains under three weak supervision conditions. They analyze training dynamics by tracking how training reward saturation relates to downstream generalization performance. The study examines pre-RL model properties, specifically reasoning faithfulness (whether intermediate steps logically support final answers) and output diversity, to identify predictors of generalization success. They then disentangle the effects of continual pre-training on domain data versus supervised fine-tuning on explicit reasoning traces. Finally, they validate their findings by applying identified interventions to Llama3.2-3B-Base to transform a non-generalizing model into one that succeeds across all weak supervision settings.

03 · Key insights

What the paper shows.

01Generalization under weak supervision is governed by training reward saturation dynamics: models that generalize show prolonged pre-saturation phases where training reward and downstream performance improve together, while rapidly saturating models memorize instead of learning

02Reasoning faithfulness—the extent to which intermediate reasoning steps logically support final answers—is the critical pre-RL property that predicts whether a model will generalize, while output diversity alone provides no predictive signal

03Supervised fine-tuning on explicit reasoning traces is necessary for generalization under weak supervision, establishing the logical structure needed for learning from imperfect signals

04Continual pre-training on domain-specific data amplifies the benefits of reasoning-trace SFT, and combining both interventions enables generalization where neither alone suffices

04 · Results

The study demonstrates that models exhibiting prolonged pre-saturation training dynamics successfully generalize across weak supervision settings, while rapidly saturating models fail to generalize. Reasoning faithfulness emerges as a reliable predictor of this behavior across diverse model families and domains. When applied to Llama3.2-3B-Base, the combination of continual pre-training and supervised fine-tuning on reasoning traces enabled the model to generalize successfully across all three weak supervision settings (scarce data, noisy rewards, and self-supervised proxy rewards) where the base model had previously failed. The interventions worked synergistically, with SFT providing necessary reasoning structure and continual pre-training amplifying the effect.

05 · Limitations

The paper does not specify exact quantitative thresholds for reasoning faithfulness that guarantee generalization, making it difficult to predict success for new models without empirical testing. The study focuses on verifiable reasoning domains where reward signals can be constructed, which may limit applicability to open-ended tasks without clear correctness criteria. The interventions require additional computational resources for continual pre-training and curating reasoning trace data for supervised fine-tuning. The paper does not extensively explore how findings scale to very large models or whether the identified dynamics hold across all possible reasoning domains beyond those tested.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers