Near-Future Policy Optimization

Chuanyu Qin; Chenxu Yang; Qingyi Si; Naibin Gu; Dingyu Yao; Zheng Lin; Peng Fu; Nan Duan; Jiaqi Wang

✨ TL;DR

Near-Future Policy Optimization (NPO) improves reinforcement learning with verifiable rewards by learning from a policy's own future checkpoints, which are both higher quality and closer to the current policy than external sources. The method achieves significant performance gains on vision-language models, improving from 57.88 to 63.15 on Qwen3-VL-8B-Instruct.

01 · Problem

Reinforcement learning with verifiable rewards (RLVR) requires mixing off-policy trajectories with on-policy exploration to accelerate convergence and improve performance. However, existing mixed-policy approaches face a fundamental trade-off: external teacher trajectories provide high quality but are distributionally distant from the current policy, while replayed past trajectories are close but capped in quality. Neither approach simultaneously satisfies the conditions needed to maximize the effective learning signal (strong enough to provide new knowledge, close enough to be readily absorbed).

02 · Approach

NPO proposes using a policy's own near-future checkpoint as the source of auxiliary trajectories. This naturally balances trajectory quality against variance cost, as a later checkpoint from the same training run is inherently stronger than the current policy while remaining closer than any external source. The paper validates NPO through two manual interventions: early-stage bootstrapping and late-stage plateau breakthrough. AutoNPO extends this by automatically triggering interventions based on online training signals and selecting the guide checkpoint that maximizes the learning signal S = Q/V.

03 · Key insights

What the paper shows.

01Near-future checkpoints from the same training run provide an optimal middle ground between external teachers and past trajectories in the quality-distance trade-off

02The effective learning signal S = Q/V captures the balance between trajectory quality (Q) and variance cost (V), providing a principled metric for selecting auxiliary trajectories

03Automatic detection of intervention opportunities (early-stage bootstrapping and late-stage plateau breakthrough) enables adaptive application of NPO without manual tuning

04The method is compatible with existing RLVR frameworks like GRPO and can be applied to large vision-language models

04 · Results

On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84 (4.96 point improvement), and AutoNPO further pushes performance to 63.15 (5.27 point improvement). The method raises the final performance ceiling while accelerating convergence during training.

05 · Limitations

The paper does not explicitly discuss limitations, but implicit constraints include: evaluation is demonstrated primarily on a single model architecture (Qwen3-VL-8B-Instruct), the computational cost of maintaining and evaluating future checkpoints is not analyzed, the method's applicability to other RLVR settings beyond vision-language models is unclear, and the sensitivity to hyperparameters such as checkpoint selection intervals is not thoroughly explored.

✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers