Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

Samuel Salfati

✨ TL;DR

This paper systematically studies transformer compression across GPT-2 and Mistral 7B, identifying five structural properties that explain why certain compression techniques fail, particularly showing that high-variance activations are not predictively important and that block-level approximations suffer from distribution shift. The findings suggest fundamental limits to static compression and motivate adaptive, per-token approaches instead.

01 · Problem

Transformer models are increasingly large and expensive to deploy, motivating compression techniques. However, existing compression methods often fail in practice, and it remains unclear why certain structural properties of transformers make them resistant to compression. Understanding these fundamental properties is necessary to develop more effective compression strategies.

02 · Approach

The authors conduct over 40 systematic experiments on GPT-2 (124M) and Mistral 7B (7.24B) examining multiple compression techniques: spectral compression, block-level function replacement, rotation-based quantization, activation geometry analysis, and adaptive early exit. They measure properties using canonical correlation analysis (CCA) to relate variance to predictive importance, linear regression (R²) to assess block linearity, and KL divergence to identify computationally easy tokens.

03 · Key insights

What the paper shows.

01High-variance activation directions are approximately 96% uncorrelated with predictive directions, and projecting onto high-variance subspaces preserves 90% of variance while significantly degrading perplexity, demonstrating that variance is not a proxy for importance

02Transformer blocks exhibit high linearity (R² ~0.95 for GPT-2, 0.93 for Mistral block 31) only under the correct upstream distribution; modifying earlier blocks causes distribution shift that breaks downstream linear approximations

03Weight factorization and quantization approaches amplify errors through cross-terms, making direct quantization strictly superior to reconstruction-based methods

04Linearity increases substantially with depth in Mistral 7B (R² from 0.17 at block 0 to 0.93 at block 31), suggesting a division between nonlinear early feature construction and linear later refinement

04 · Results

Single-block linear replacement of Mistral 7B's final block achieves 34x compression with only 1.71 perplexity increase. However, multi-block replacement fails due to residual error accumulation and distribution shift. The analysis reveals that approximately 30% of tokens are computationally easy (confirmed via exit heads and KL divergence sensitivity), suggesting potential for adaptive computation.

05 · Limitations

The study is limited to two model families (GPT-2 and Mistral) and does not explore dynamic or training-time compression methods. The linear approximation analysis assumes specific functional forms and may not capture all nonlinear behaviors. The findings on block linearity are specific to the tested architectures and may not generalize to other transformer variants or attention mechanisms. The paper does not provide practical deployment strategies for the proposed adaptive per-token computation approach.

✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers