Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

Andrew Klearman; Radu Revutchi; Rohin Garg; Rishav Chakravarti; Samuel Marc Denton; Yuan Xue

✨ TL;DR

This paper proposes semantic stratification, a structured evaluation framework for retrieval systems that organizes documents into entity-based clusters and systematically generates queries to ensure comprehensive coverage. It addresses hidden biases in current heuristic evaluation approaches and provides formal guarantees for more trustworthy retrieval assessment.

01 · Problem

Current retrieval evaluation methods rely on heuristically constructed query sets that introduce hidden intrinsic biases, making it difficult to reliably assess retrieval quality—a critical bottleneck for retrieval-augmented generation (RAG) systems. The paper formalizes this as a statistical estimation problem, showing that metric reliability is fundamentally limited by how evaluation sets are constructed. This limitation undermines confidence in retrieval system comparisons and decision-making.

02 · Approach

The authors introduce semantic stratification, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters. The method systematically generates queries for missing strata to ensure comprehensive coverage. This approach provides formal semantic coverage guarantees across different retrieval regimes and offers interpretable visibility into retrieval failure modes, moving beyond aggregate metrics to structured, coverage-based evaluation.

03 · Key insights

What the paper shows.

01Metric reliability in retrieval evaluation is fundamentally constrained by evaluation-set construction methodology, not just sample size

02Organizing documents into entity-based semantic clusters enables systematic identification of coverage gaps in evaluation

03Stratified evaluation reveals structural signals that explain variance in retrieval performance across different methods

04Coverage-based evaluation provides more stable and transparent assessments than aggregate metrics for trustworthy system comparison

04 · Results

Experiments across multiple benchmarks and retrieval methods validate the semantic stratification framework. The results expose systematic coverage gaps in existing evaluations, identify structural signals explaining retrieval performance variance, and demonstrate that stratified evaluation yields more stable and transparent assessments. The framework supports more trustworthy decision-making compared to traditional aggregate metrics.

05 · Limitations

The paper does not explicitly discuss computational costs of semantic stratification or scalability to very large corpora. The reliance on entity-based clustering may have varying effectiveness across different domains or corpus types. The paper does not provide detailed analysis of how the approach performs with emerging retrieval methods or in cross-lingual or multilingual settings.

✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers