Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective

Sijie Mai; Shiqin Han

✨ TL;DR

This paper proposes CmIR, a causal inference framework that separates multimodal data into stable causal features and spurious environment-specific features to improve robustness in affective computing. The method achieves state-of-the-art performance, especially on out-of-distribution and noisy data.

01 · Problem

Current multimodal affective computing models that predict human sentiment, emotion, and intention from language, acoustic, and visual inputs suffer from learning spurious correlations. These spurious correlations harm the models' ability to generalize when faced with distribution shifts or noisy modalities. The models fail to distinguish between stable causal relationships that should transfer across different environments and environment-specific patterns that are unreliable.

02 · Approach

The paper introduces CmIR (Causal modality-Invariant Representation), a framework that disentangles each modality into two components: causal invariant representations that maintain stable predictive relationships with labels across environments, and environment-specific spurious representations. The method employs three key constraints: an invariance constraint to ensure stability across environments, a mutual information constraint to preserve relevant information, and a reconstruction constraint to retain sufficient information from raw inputs. This disentanglement is grounded in causal inference theory.

03 · Key insights

What the paper shows.

01Spurious correlations in multimodal learning can be addressed by explicitly separating causal invariant features from environment-specific spurious features using causal inference principles

02Invariant representations should maintain stable predictive relationships with labels across different environments while preserving sufficient information from original inputs

03The combination of invariance, mutual information, and reconstruction constraints enables effective disentanglement without losing critical information

04Robustness to distribution shifts and noisy modalities can be achieved by focusing predictions on causal invariant representations rather than spurious correlations

04 · Results

CmIR achieves state-of-the-art performance across multiple multimodal benchmarks for affective computing tasks. The method demonstrates particularly strong performance on out-of-distribution data and noisy data scenarios, confirming its superior robustness and generalizability compared to existing approaches. The results validate that the causal disentanglement approach successfully improves model performance under challenging conditions where spurious correlations would typically cause failures.

05 · Limitations

The paper does not explicitly discuss limitations in the abstract. Potential implicit limitations include the computational overhead of learning disentangled representations with multiple constraints, the requirement for environment labels or proxies to learn invariant features, and possible challenges in defining what constitutes different environments in real-world deployment scenarios. The effectiveness may also depend on the quality of the disentanglement and whether true causal features can be reliably separated from spurious ones in practice.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers