V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

Yubo Jiang; Yitong An; Xin Yang; Abudukelimu Wuerkaixi; Xuxin Cheng; Fengying Xie; Zhiguo Jiang; Cao Liu; Ke Zeng; Haopeng Zhang

✨ TL;DR

V-tableR1 is a reinforcement learning framework that trains multimodal language models to perform rigorous, step-by-step reasoning on tables rather than relying on pattern matching. It uses a critic model to provide feedback on visual reasoning chains and a novel optimization algorithm (PGPO) to improve performance.

01 · Problem

Current multimodal large language models trained on final outcomes alone treat visual reasoning as a black box, relying on superficial pattern matching rather than performing genuine multi-step inference. While reinforcement learning with verifiable rewards could enforce transparent reasoning, extending this to visual domains is challenging because it's difficult to ground abstract logic into continuous pixel space in a verifiable way.

02 · Approach

V-tableR1 leverages tables as an ideal visual testbed due to their deterministic grid structure. The system uses a specialized critic VLM to provide dense, step-level feedback on explicit visual chain-of-thought reasoning generated by a policy VLM. The authors propose Process-Guided Direct Alignment Policy Optimization (PGPO), a novel RL algorithm that integrates process rewards, decoupled policy constraints, and length-aware dynamic sampling to optimize the policy model.

03 · Key insights

What the paper shows.

01Tables' deterministic grid structure provides an ideal domain for grounding abstract logic into verifiable visual reasoning

02Process-level supervision through a critic VLM can effectively penalize visual hallucinations and shortcut guessing

03Decoupling policy constraints and using length-aware dynamic sampling improves RL optimization in the multimodal setting

04Shifting from black-box pattern matching to verifiable logical derivation enables smaller models to outperform much larger competitors

04 · Results

V-tableR1 4B achieves state-of-the-art accuracy among open-source models on complex tabular benchmarks, outperforming models up to 18 times its size and showing significant improvements over its supervised fine-tuning baseline. The framework explicitly penalizes visual hallucinations and shortcut guessing through process-level feedback.

05 · Limitations

The approach is specifically designed and evaluated for table reasoning tasks; generalization to other visual domains with less structured layouts is not addressed. The paper does not discuss computational costs of training with the critic VLM or scalability to other visual reasoning tasks beyond tabular data.

✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers