Overcoming Selection Bias in Statistical Studies With Amortized Bayesian Inference

Jonas Arruda; Sophie Chervet; Paula Staudt; Andreas Wieser; Michael Hoelscher; Isabelle Sermet-Gaudelus; Nadine Binder; Lulla Opatowski; Jan Hasenauer

✨ TL;DR

This paper develops a simulation-based Bayesian inference method that corrects for selection bias by embedding the selection mechanism directly into the generative model, enabling accurate parameter estimation in complex models where traditional likelihood-based approaches fail. The approach allows both debiased estimation and explicit testing for bias presence without requiring tractable likelihoods.

01 · Problem

Selection bias occurs when the probability of an observation being included in a dataset depends on variables related to the quantities being studied, causing systematic distortions in parameter estimates and uncertainty quantification. This is common in epidemiological studies and surveys where individuals with certain characteristics are more likely to be sampled. Classical correction methods like inverse-probability weighting or explicit likelihood-based selection models require tractable likelihoods, which severely limits their use in complex models with latent dynamics or high-dimensional structure. Existing simulation-based inference methods enable Bayesian analysis without tractable likelihoods but typically assume data is missing at random, making them fail when selection depends on unobserved outcomes or covariates. This creates a critical gap: complex stochastic models that would benefit from simulation-based inference are precisely the settings where selection bias is difficult to address with traditional methods, leaving researchers without practical tools for obtaining unbiased estimates in these scenarios.

02 · Approach

The authors develop a bias-aware simulation-based inference framework that explicitly incorporates the selection mechanism into neural posterior estimation. The key innovation is embedding the selection process directly into the generative simulator rather than treating it as a separate correction step. By simulating both the underlying data generation process and the selection mechanism that determines which observations enter the dataset, the approach recasts selection bias correction as part of the simulation problem itself. This framework enables amortized Bayesian inference without requiring tractable likelihoods by training neural networks to approximate posterior distributions using simulated data that reflects the selection process. The method integrates diagnostic tools to detect discrepancies between simulated and observed data distributions and to assess posterior calibration. Importantly, the framework allows researchers to explicitly test for the presence of selection bias by comparing models with and without selection mechanisms, providing both debiased parameter estimates and evidence about whether bias correction is necessary.

03 · Key insights

What the paper shows.

01Selection bias correction can be reframed as a simulation problem by embedding the selection mechanism directly into the generative model, making it compatible with simulation-based inference methods

02Amortized neural posterior estimation can handle selection bias without tractable likelihoods, extending Bayesian inference to complex models where traditional correction methods fail

03The framework enables explicit testing for selection bias presence, not just correction, allowing researchers to assess whether bias-aware modeling is necessary for their data

04Integrating calibration diagnostics into the inference pipeline ensures that posterior distributions remain well-calibrated even when correcting for complex selection mechanisms

04 · Results

The method successfully recovered well-calibrated posterior distributions across three diverse statistical applications with different selection mechanisms. The framework demonstrated accurate parameter estimation in settings where likelihood-based approaches produced biased estimates, validating its effectiveness in complex scenarios. The integrated diagnostics successfully detected discrepancies between simulated and observed data when selection bias was present and confirmed proper posterior calibration after correction. The results showed that the simulation-based approach could handle selection mechanisms that depend on unobserved variables, a scenario where traditional methods typically fail, while maintaining computational efficiency through amortization.

05 · Limitations

The paper does not explicitly detail computational costs or scalability limits of training neural posterior estimators for high-dimensional parameter spaces or extremely complex selection mechanisms. The framework requires correctly specifying the selection mechanism in the simulator, and misspecification of this process could lead to incorrect inference, though the diagnostic tools may help detect such issues. The paper does not extensively discuss how the method performs when the selection mechanism itself has unknown parameters that must be inferred jointly with the parameters of interest. Additionally, while three applications are presented, the range of selection bias scenarios tested may not cover all possible real-world complexity, and the method's performance in settings with multiple interacting selection mechanisms or time-varying selection processes is not fully characterized.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers