GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

Jingyi Wang; Lei Zhu; Tengjin Weng; Song-Li Wu; Haochen Tan; Jierun Chen; Chaofan Tao; Haoli Bai; Lu Hou; Lifeng Shang; Xiao-Ping Zhang

✨ TL;DR

This paper introduces GRPO-VPS, which enhances Group Relative Policy Optimization by adding verifiable process supervision that tracks the model's confidence in correct answers throughout reasoning steps. This enables more targeted credit assignment and sample-efficient policy updates without requiring critic models or auxiliary supervision.

01 · Problem

Group Relative Policy Optimization (GRPO) improves LLM reasoning by eliminating critic models, but it suffers from indiscriminate credit assignment at the trajectory level. This means the model cannot distinguish which intermediate reasoning steps were actually effective, leading to poor identification of good reasoning strategies and unnecessary overthinking that increases reasoning length without improving accuracy.

02 · Approach

The method introduces model-free verifiable process supervision by probing the model's belief in the correct answer at each step of the reasoning trajectory. The generation is segmented into discrete steps, and at each segment boundary, the conditional probability of the correct answer is computed. These segment-wise progress measurements are then used to refine GRPO's trajectory-level feedback, enabling more targeted policy updates without requiring Monte Carlo rollouts or auxiliary models.

03 · Key insights

What the paper shows.

01Process-level supervision can be derived from the model's own conditional probabilities rather than external reward models or auxiliary supervision

02Tracking confidence in correct answers at segment boundaries provides interpretable and efficient credit assignment signals

03The approach is model-free and verifiable, avoiding the computational overhead of critic models or complex rollout procedures

04Fine-grained step-level feedback enables both accuracy improvements and reasoning-length reductions simultaneously

04 · Results

Experiments demonstrate consistent improvements over GRPO across multiple benchmarks. On mathematical reasoning tasks, the method achieves up to 2.6-point accuracy improvements and 13.7% reductions in reasoning length. On general-domain tasks, improvements reach up to 2.4 points in accuracy with 4% reasoning-length reduction. Results generalize across diverse model sizes and architectures.

05 · Limitations

The paper does not explicitly discuss limitations, but implicit constraints include: the approach relies on the model's ability to meaningfully express confidence through conditional probabilities, which may vary across different model architectures; the method's effectiveness may depend on the quality of step segmentation; and evaluation is limited to specific benchmark domains without exploration of performance on more diverse or adversarial reasoning tasks.

✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers