✨ TL;DR
This paper introduces GRPO-VPS, which enhances Group Relative Policy Optimization by adding verifiable process supervision that tracks the model's confidence in correct answers throughout reasoning steps. This enables more targeted credit assignment and sample-efficient policy updates without requiring critic models or auxiliary supervision.
Group Relative Policy Optimization (GRPO) improves LLM reasoning by eliminating critic models, but it suffers from indiscriminate credit assignment at the trajectory level. This means the model cannot distinguish which intermediate reasoning steps were actually effective, leading to poor identification of good reasoning strategies and unnecessary overthinking that increases reasoning length without improving accuracy.
The method introduces model-free verifiable process supervision by probing the model's belief in the correct answer at each step of the reasoning trajectory. The generation is segmented into discrete steps, and at each segment boundary, the conditional probability of the correct answer is computed. These segment-wise progress measurements are then used to refine GRPO's trajectory-level feedback, enabling more targeted policy updates without requiring Monte Carlo rollouts or auxiliary models.
What the paper shows.
Experiments demonstrate consistent improvements over GRPO across multiple benchmarks. On mathematical reasoning tasks, the method achieves up to 2.6-point accuracy improvements and 13.7% reductions in reasoning length. On general-domain tasks, improvements reach up to 2.4 points in accuracy with 4% reasoning-length reduction. Results generalize across diverse model sizes and architectures.
The paper does not explicitly discuss limitations, but implicit constraints include: the approach relies on the model's ability to meaningfully express confidence through conditional probabilities, which may vary across different model architectures; the method's effectiveness may depend on the quality of step segmentation; and evaluation is limited to specific benchmark domains without exploration of performance on more diverse or adversarial reasoning tasks.
✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.