✨ TL;DR
This paper proposes CmIR, a causal inference framework that separates multimodal data into stable causal features and spurious environment-specific features to improve robustness in affective computing. The method achieves state-of-the-art performance, especially on out-of-distribution and noisy data.
Current multimodal affective computing models that predict human sentiment, emotion, and intention from language, acoustic, and visual inputs suffer from learning spurious correlations. These spurious correlations harm the models' ability to generalize when faced with distribution shifts or noisy modalities. The models fail to distinguish between stable causal relationships that should transfer across different environments and environment-specific patterns that are unreliable.
The paper introduces CmIR (Causal modality-Invariant Representation), a framework that disentangles each modality into two components: causal invariant representations that maintain stable predictive relationships with labels across environments, and environment-specific spurious representations. The method employs three key constraints: an invariance constraint to ensure stability across environments, a mutual information constraint to preserve relevant information, and a reconstruction constraint to retain sufficient information from raw inputs. This disentanglement is grounded in causal inference theory.
What the paper shows.
CmIR achieves state-of-the-art performance across multiple multimodal benchmarks for affective computing tasks. The method demonstrates particularly strong performance on out-of-distribution data and noisy data scenarios, confirming its superior robustness and generalizability compared to existing approaches. The results validate that the causal disentanglement approach successfully improves model performance under challenging conditions where spurious correlations would typically cause failures.
The paper does not explicitly discuss limitations in the abstract. Potential implicit limitations include the computational overhead of learning disentangled representations with multiple constraints, the requirement for environment labels or proxies to learn invariant features, and possible challenges in defining what constitutes different environments in real-world deployment scenarios. The effectiveness may also depend on the quality of the disentanglement and whether true causal features can be reliably separated from spurious ones in practice.
✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.