Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

Zhenwen Liang; Yujun Zhou; Sidi Lu; Xiangliang Zhang; Haitao Mi; Dong Yu

✨ TL;DR

This paper addresses a critical problem in reinforcement learning for large language models: when base models are already very accurate on training benchmarks, standard RL methods fail because there aren't enough errors to learn from, causing models to collapse into repetitive solutions. The authors propose CUTS, a novel sampling strategy that maintains solution diversity even when models are highly accurate, improving generalization on challenging out-of-domain math problems by up to 15.1%.

01 · Problem

As large language models become stronger, they increasingly saturate standard reasoning benchmarks like MATH, producing correct but nearly identical solutions. This creates a paradox for reinforcement learning: group-relative algorithms like GRPO rely on comparing good and bad solutions to compute advantage signals that guide learning. When models are already highly accurate, there are few failure cases to learn from, causing the advantage signal to vanish. This leads to mode collapse where the policy degenerates into producing homogeneous, repetitive solutions rather than exploring diverse reasoning paths. The fundamental issue is that traditional RL approaches designed for weaker models break down when applied to already-capable base models, preventing further improvement and limiting out-of-domain generalization.

02 · Approach

The authors propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy that enforces exploration while maintaining solution quality. Unlike standard sampling methods that follow the model's probability distribution (and thus its biases), CUTS flattens the local optimization landscape by sampling uniformly from a constrained set of high-confidence candidate tokens. This preserves structural diversity in generated solutions without sacrificing correctness. They integrate CUTS into Mixed-CUTS, a training framework that combines exploitative rollouts (using standard sampling) with exploratory rollouts (using CUTS). This synergistic approach amplifies intra-group advantage variance, providing meaningful learning signals even when most solutions are correct. The method maintains diversity within the semantic manifold of valid solutions, allowing the model to explore different reasoning paths while staying within the space of correct answers.

03 · Key insights

What the paper shows.

01Strong base models that saturate benchmarks create a paradox where traditional RL fails due to vanishing advantage signals from lack of failure cases

02Uniform sampling from constrained high-confidence candidates preserves solution diversity without sacrificing correctness, preventing mode collapse

03Mixing exploitative and exploratory rollouts amplifies intra-group variance, providing meaningful learning signals even in high-accuracy regimes

04Maintaining diversity within the semantic manifold of correct solutions is critical for out-of-domain generalization in rigorous reasoning tasks

04 · Results

Experiments on Qwen3 models demonstrate that Mixed-CUTS significantly outperforms standard GRPO, particularly on out-of-domain generalization. On the challenging AIME25 benchmark, the approach achieves up to 15.1% improvement in Pass@1 accuracy over standard GRPO. The method successfully prevents policy degeneration that occurs with vanilla group-relative algorithms when training on saturated data. The results validate that the proposed exploration strategy maintains meaningful diversity in solution generation, enabling continued learning even when base models are already highly accurate on training benchmarks. The improvements are most pronounced on difficult out-of-domain problems, demonstrating that diversity-preserving exploration enhances the model's ability to generalize beyond its training distribution.

05 · Limitations

The paper does not explicitly discuss computational overhead introduced by the dual rollout strategy in Mixed-CUTS, which requires generating both exploitative and exploratory solutions during training. While the method is described as parameter-free, the practical choice of K in top-K sampling and the mixing ratio between exploitative and exploratory rollouts may require tuning for different model scales or domains. The evaluation focuses primarily on mathematical reasoning benchmarks, leaving open questions about generalization to other reasoning domains or task types. The paper does not address potential failure modes when the constrained candidate set itself becomes too narrow, or how the method performs when base models are not yet saturated on the training distribution. Additionally, the long-term effects of enforced uniform sampling on model calibration and confidence estimation are not thoroughly explored.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers