✨ TL;DR
This paper investigates how well small language models can learn reasoning tasks through reinforcement learning when training data and compute are limited. The study finds that mixing easy and hard problems during training provides up to 5x better sample efficiency than training on easy problems alone.
Fine-tuning large language models typically requires massive amounts of high-quality annotated data and substantial computational resources, especially when using Reinforcement Learning with Verifiable Rewards (RLVR) to improve reasoning capabilities. While prior research has shown benefits from scaling both data and compute for RLVR, these approaches are impractical in many real-world scenarios where organizations face constraints on available annotated data and accessible compute resources. There is a critical need to understand how models can be effectively trained with limited resources, yet systematic studies examining RLVR performance in low-data regimes are lacking.
The researchers conducted a comprehensive empirical study using open-source Small Language Models (SLMs) trained with RLVR across three procedurally-generated datasets: number counting problems, graph reasoning tasks, and spatial reasoning challenges. They systematically varied dataset properties including size, diversity, and complexity to characterize how these factors affect model performance in low-data settings. The procedural generation approach allowed precise control over task difficulty and dataset composition, enabling fine-grained analysis of how models trained on tasks of varying complexity generalize to new problems. The study specifically examined training strategies including single-complexity training versus mixed-complexity training to identify optimal data utilization patterns.
What the paper shows.
The experiments demonstrated that mixed-complexity training substantially outperforms single-complexity approaches in low-data scenarios, providing up to 5x sample efficiency gains over easy-task-only training. Models successfully transferred knowledge from simpler to more complex tasks across all three reasoning domains (counting, graph reasoning, and spatial reasoning). The procedurally-generated datasets proved effective for both evaluation and training, with controllable properties enabling precise measurement of how dataset characteristics impact learning outcomes. These results held consistently across the different reasoning task types, suggesting the findings may generalize to other domains amenable to procedural generation.
The study focuses exclusively on small language models and three specific reasoning domains (counting, graphs, spatial reasoning), which may limit generalizability to larger models or other task types. The research relies on procedurally-generated datasets with verifiable rewards, which may not capture the full complexity of real-world problems where ground truth is ambiguous or difficult to define. The paper does not extensively explore the upper limits of task complexity or the point at which mixed-complexity training advantages diminish. Additionally, while the study examines low-data regimes, the specific thresholds and definitions of 'low data' may vary across different applications and model architectures.
✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.