✨ TL;DR
FUSE is a method that combines multiple imperfect AI verifiers to better judge model outputs without needing any labeled training data. It matches or beats semi-supervised methods across diverse benchmarks by controlling how verifiers depend on each other using spectral algorithms.
Verifying whether large language model outputs are correct is critical for both training and deployment, but obtaining ground truth labels is expensive and time-consuming. In practice, people use imperfect LLM judges and reward models as verifiers, but these individual verifiers are unreliable. While ensembling multiple verifiers could improve performance, existing ensemble methods typically require labeled data to learn how to combine verifiers effectively. This creates a chicken-and-egg problem: you need labels to build better verifiers, but the whole point of automated verification is to avoid manual labeling.
FUSE (Fully Unsupervised Score Ensembling) combines multiple verifiers without any ground truth labels by leveraging spectral algorithms from the ensembling literature. The key innovation is controlling conditional dependencies between verifiers to improve unsupervised ensemble performance. Rather than learning weights from labeled data, FUSE uses the statistical properties of how verifiers agree and disagree with each other to infer which combinations are most reliable. The method works with diverse types of verifiers including LLM judges and reward models, and can be applied at test time to improve verification quality for any generator model.
What the paper shows.
FUSE demonstrates strong performance across multiple benchmarks including GPQA Diamond, Humanity's Last Exam, and IMO Shortlist questions. The method typically matches or improves upon semi-supervised baselines despite using zero labeled data. The approach proves effective in test-time scaling experiments with diverse combinations of generator models and verifiers, showing that unsupervised ensembling can be a practical alternative to methods requiring ground truth labels. Performance holds across both conventional academic benchmarks and frontier, unsaturated benchmarks where problems remain challenging even for state-of-the-art models.
The paper does not explicitly detail failure modes or conditions under which FUSE might underperform. The reliance on spectral algorithms and conditional dependency assumptions may limit effectiveness when verifiers have certain correlation structures. The method's performance likely depends on having access to multiple diverse verifiers, which may not always be available. Additionally, while the paper tests on several benchmarks, the generalization to domains with fundamentally different verification characteristics remains unclear. The computational cost of the spectral algorithms and scalability to very large numbers of verifiers is not discussed.
✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.