✨ TL;DR
Near-Future Policy Optimization (NPO) improves reinforcement learning with verifiable rewards by learning from a policy's own future checkpoints, which are both higher quality and closer to the current policy than external sources. The method achieves significant performance gains on vision-language models, improving from 57.88 to 63.15 on Qwen3-VL-8B-Instruct.
Reinforcement learning with verifiable rewards (RLVR) requires mixing off-policy trajectories with on-policy exploration to accelerate convergence and improve performance. However, existing mixed-policy approaches face a fundamental trade-off: external teacher trajectories provide high quality but are distributionally distant from the current policy, while replayed past trajectories are close but capped in quality. Neither approach simultaneously satisfies the conditions needed to maximize the effective learning signal (strong enough to provide new knowledge, close enough to be readily absorbed).
NPO proposes using a policy's own near-future checkpoint as the source of auxiliary trajectories. This naturally balances trajectory quality against variance cost, as a later checkpoint from the same training run is inherently stronger than the current policy while remaining closer than any external source. The paper validates NPO through two manual interventions: early-stage bootstrapping and late-stage plateau breakthrough. AutoNPO extends this by automatically triggering interventions based on online training signals and selecting the guide checkpoint that maximizes the learning signal S = Q/V.
What the paper shows.
On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84 (4.96 point improvement), and AutoNPO further pushes performance to 63.15 (5.27 point improvement). The method raises the final performance ceiling while accelerating convergence during training.
The paper does not explicitly discuss limitations, but implicit constraints include: evaluation is demonstrated primarily on a single model architecture (Qwen3-VL-8B-Instruct), the computational cost of maintaining and evaluating future checkpoints is not analyzed, the method's applicability to other RLVR settings beyond vision-language models is unclear, and the sensitivity to hyperparameters such as checkpoint selection intervals is not thoroughly explored.
✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.