Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees

Xueyan Li; Johannes Zenn; Ekaterina Fadeeva; Guinan Su; Mrinmaya Sachan; Jonas Geiping

✨ TL;DR

DLE is a deterministic decoding method that systematically explores distinct reasoning paths in a truncated decoding tree instead of sampling with replacement, improving inference efficiency and performance on math, coding, and reasoning tasks.

01 · Problem

Self-consistency sampling improves inference performance by generating multiple reasoning traces and voting on answers. However, this approach is computationally inefficient in constrained domains like mathematics and code because it repeatedly samples the same high-probability prefixes and generates duplicate completions, wasting compute budget on redundant exploration.

02 · Approach

Distinct Leaf Enumeration (DLE) treats truncated sampling as tree traversal through a pruned decoding tree and deterministically enumerates distinct leaf nodes rather than sampling with replacement. The method improves efficiency in two ways: algorithmically by systematically exploring previously unvisited high-probability branches to increase search space coverage, and systemically by reusing shared prefixes to reduce redundant token generation.

03 · Key insights

What the paper shows.

01Sampling with replacement in self-consistency wastes compute by repeatedly visiting identical high-probability prefixes and generating duplicate completions

02Deterministic enumeration of distinct leaves in a truncated decoding tree provides better coverage of the search space under fixed computational budgets

03Prefix reuse in tree-based enumeration significantly reduces redundant token generation compared to independent sampling

04Systematic exploration of high-probability branches yields higher-quality reasoning traces than stochastic sampling approaches

04 · Results

DLE outperforms stochastic self-consistency on math, coding, and general reasoning tasks by exploring higher-quality reasoning traces. The method achieves better performance while maintaining or improving computational efficiency through reduced redundant token generation and more systematic coverage of the truncated search space.

05 · Limitations

The paper does not explicitly discuss limitations, though the approach appears tailored to constrained domains with structured outputs (math, code) where duplicate completions are common. Applicability to open-ended generation tasks or domains with high output diversity is not addressed. The specific computational savings and performance gains are not quantified with detailed metrics or comparisons across different budget constraints.

✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers