ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification

Florian Kittler; Sheethal Bhat; Andreas Maier

✨ TL;DR

ProtoCLIP refines CLIP-style vision-language models for chest X-ray classification by using curated training data and prototype-aligned distillation to reduce co-occurrence bias and improve zero-shot performance. The method achieves 2-10 percentage point AUC improvements over baseline CLIP on unseen chest X-ray datasets without large-scale retraining.

01 · Problem

Zero-shot vision-language models like CLIP show promise for chest X-ray classification but suffer from three key limitations: confounding label co-occurrence (where certain pathologies frequently appear together, causing the model to confuse them), long-tail class imbalance (rare pathologies are underrepresented), and transfer instability under domain shift (performance degrades when applied to new datasets from different sources). These issues are particularly problematic in medical imaging where accurate discrimination between co-occurring pathologies is clinically critical, and models must generalize reliably to new hospital systems and imaging protocols.

02 · Approach

ProtoCLIP introduces a refinement strategy with two main components. First, it constructs pathology-focused training subsets with carefully curated negative samples to reduce co-occurrence bias, ensuring the model learns to distinguish between frequently co-occurring conditions. Second, it employs a representation-preserving distillation objective that uses prototype anchors to guide the adaptation process. This distillation approach stabilizes the model during fine-tuning while maintaining the semantic structure learned during pre-training and improving discrimination of clinically relevant co-occurring pathologies. The method is designed to work without requiring large-scale retraining of the base model.

03 · Key insights

What the paper shows.

01Curated negative sampling in training subsets can effectively reduce co-occurrence bias that plagues medical image classification

02Prototype-aligned distillation preserves semantic structure from pre-training while enabling targeted refinement for specific pathologies

03Anchor-guided refinement provides a computationally efficient alternative to full-scale retraining for improving zero-shot medical VLM performance

04Controlled adaptation through distillation objectives can stabilize transfer learning and prevent catastrophic forgetting of pre-trained knowledge

04 · Results

ProtoCLIP was evaluated on VinDr-CXR, an unseen chest X-ray dataset not used during training. The method achieved 2-10 percentage point improvements in AUC over a strong CLIP-based baseline across multiple pathological findings. For pneumothorax detection specifically, ProtoCLIP reached a state-of-the-art AUC of 0.94. These improvements demonstrate that the approach successfully addresses zero-shot transfer failures without requiring expensive large-scale retraining, showing particular strength in handling the challenging co-occurrence patterns common in chest radiography.

05 · Limitations

The paper does not explicitly discuss limitations in the abstract. Potential implicit limitations include: the method was evaluated on a single unseen dataset (VinDr-CXR), so generalization to other medical imaging domains or modalities remains unclear; the approach still requires some curated training data construction, which may require domain expertise; and while the method avoids large-scale retraining, it still requires a refinement phase that adds computational overhead compared to pure zero-shot inference. The extent to which improvements transfer to other chest X-ray findings beyond those tested is also not fully characterized.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers