Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling

Yidi Yuan

✨ TL;DR

This paper shows that applying Semantic Tube Prediction (STP) at reasoning step boundaries instead of random token positions dramatically improves multi-step latent prediction in LLMs (168x vs 4x improvement), revealing that sampling position is critical for geometric regularization of reasoning trajectories.

01 · Problem

Semantic Tube Prediction (STP) is a technique that regularizes LLM hidden-state trajectories toward locally linear geodesics during fine-tuning to improve data efficiency. The original STP approach samples random token sub-spans, but it remains unclear whether the choice of sampling position affects the semantic structure and geometric properties of multi-step reasoning trajectories. Understanding how sampling strategy impacts the geometric regularization of reasoning paths is important for optimizing LLM training and improving reasoning capabilities.

02 · Approach

The researchers modified the STP approach by applying it at consecutive semantic reasoning step boundaries rather than random token positions. They evaluated this step-boundary STP against frozen baselines and random-token STP using ProcessBench (3,400 samples), measuring multi-step latent prediction accuracy. To probe the geometric properties of the resulting trajectories, they used both linear extrapolation and a learned 3-layer MLP predictor to analyze the latent manifold structure. They also investigated the tradeoff between language modeling loss and geometric purity by training models with and without the language modeling objective.

03 · Key insights

What the paper shows.

01Sampling position is the critical variable in geometric regularization: step-boundary STP achieves 168x better multi-step latent prediction compared to only 4x for random-token STP

02STP-shaped trajectories form smooth curves rather than straight lines in latent space, as evidenced by 3-layer MLPs reducing prediction error 3-12x beyond linear extrapolation

03There exists a tradeoff between generation quality and geometric purity: removing language modeling loss yields trajectories 2x more predictable by MLPs

04Multi-step latent prediction MSE serves as an effective evaluation metric for geometric regularization methods in reasoning trajectories

04 · Results

On ProcessBench with 3,400 samples, step-boundary STP achieved 168x more accurate multi-step latent prediction than frozen baselines, vastly outperforming the 4x improvement of random-token STP. When probing trajectory geometry with a 3-layer MLP predictor, the models showed 3-12x lower prediction error compared to linear extrapolation, demonstrating that the trajectories are smooth curves. Models trained without language modeling loss exhibited trajectories that were 2x more MLP-predictable than those trained with combined loss, though this came at the cost of generation quality.

05 · Limitations

The paper does not explicitly discuss limitations, but several are implicit in the work. The tradeoff between geometric purity and generation quality suggests that optimizing for trajectory smoothness may compromise language generation performance. The evaluation is limited to ProcessBench with 3,400 samples, so generalization to other reasoning tasks and larger datasets is unclear. The paper focuses on prediction accuracy as the primary metric but does not thoroughly evaluate downstream task performance or reasoning accuracy. Additionally, the computational overhead of step-boundary sampling compared to random-token sampling is not discussed, nor is the sensitivity to how reasoning step boundaries are defined or detected.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers