FUSE: Ensembling Verifiers with Zero Labeled Data

Joonhyuk Lee; Virginia Ma; Sarah Zhao; Yash Nair; Asher Spector; Regev Cohen; Emmanuel J. Candès

✨ TL;DR

FUSE is a method that combines multiple imperfect AI verifiers to better judge model outputs without needing any labeled training data. It matches or beats semi-supervised methods across diverse benchmarks by controlling how verifiers depend on each other using spectral algorithms.

01 · Problem

Verifying whether large language model outputs are correct is critical for both training and deployment, but obtaining ground truth labels is expensive and time-consuming. In practice, people use imperfect LLM judges and reward models as verifiers, but these individual verifiers are unreliable. While ensembling multiple verifiers could improve performance, existing ensemble methods typically require labeled data to learn how to combine verifiers effectively. This creates a chicken-and-egg problem: you need labels to build better verifiers, but the whole point of automated verification is to avoid manual labeling.

02 · Approach

FUSE (Fully Unsupervised Score Ensembling) combines multiple verifiers without any ground truth labels by leveraging spectral algorithms from the ensembling literature. The key innovation is controlling conditional dependencies between verifiers to improve unsupervised ensemble performance. Rather than learning weights from labeled data, FUSE uses the statistical properties of how verifiers agree and disagree with each other to infer which combinations are most reliable. The method works with diverse types of verifiers including LLM judges and reward models, and can be applied at test time to improve verification quality for any generator model.

03 · Key insights

What the paper shows.

01Verification quality can be improved through unsupervised ensembling by exploiting the structure of conditional dependencies between imperfect verifiers

02Zero-shot ensemble methods can match or exceed semi-supervised alternatives that have access to labeled training data

03The approach generalizes across different types of verifiers (LLM judges, reward models), generator models, and problem domains

04Spectral algorithms from the ensembling literature can be adapted for the verification setting by carefully controlling how verifier outputs relate to each other

04 · Results

FUSE demonstrates strong performance across multiple benchmarks including GPQA Diamond, Humanity's Last Exam, and IMO Shortlist questions. The method typically matches or improves upon semi-supervised baselines despite using zero labeled data. The approach proves effective in test-time scaling experiments with diverse combinations of generator models and verifiers, showing that unsupervised ensembling can be a practical alternative to methods requiring ground truth labels. Performance holds across both conventional academic benchmarks and frontier, unsaturated benchmarks where problems remain challenging even for state-of-the-art models.

05 · Limitations

The paper does not explicitly detail failure modes or conditions under which FUSE might underperform. The reliance on spectral algorithms and conditional dependency assumptions may limit effectiveness when verifiers have certain correlation structures. The method's performance likely depends on having access to multiple diverse verifiers, which may not always be available. Additionally, while the paper tests on several benchmarks, the generalization to domains with fundamentally different verification characteristics remains unclear. The computational cost of the spectral algorithms and scalability to very large numbers of verifiers is not discussed.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers