✨ TL;DR
This paper systematically studies transformer compression across GPT-2 and Mistral 7B, identifying five structural properties that explain why certain compression techniques fail, particularly showing that high-variance activations are not predictively important and that block-level approximations suffer from distribution shift. The findings suggest fundamental limits to static compression and motivate adaptive, per-token approaches instead.
Transformer models are increasingly large and expensive to deploy, motivating compression techniques. However, existing compression methods often fail in practice, and it remains unclear why certain structural properties of transformers make them resistant to compression. Understanding these fundamental properties is necessary to develop more effective compression strategies.
The authors conduct over 40 systematic experiments on GPT-2 (124M) and Mistral 7B (7.24B) examining multiple compression techniques: spectral compression, block-level function replacement, rotation-based quantization, activation geometry analysis, and adaptive early exit. They measure properties using canonical correlation analysis (CCA) to relate variance to predictive importance, linear regression (R²) to assess block linearity, and KL divergence to identify computationally easy tokens.
What the paper shows.
Single-block linear replacement of Mistral 7B's final block achieves 34x compression with only 1.71 perplexity increase. However, multi-block replacement fails due to residual error accumulation and distribution shift. The analysis reveals that approximately 30% of tokens are computationally easy (confirmed via exit heads and KL divergence sensitivity), suggesting potential for adaptive computation.
The study is limited to two model families (GPT-2 and Mistral) and does not explore dynamic or training-time compression methods. The linear approximation analysis assumes specific functional forms and may not capture all nonlinear behaviors. The findings on block linearity are specific to the tested architectures and may not generalize to other transformer variants or attention mechanisms. The paper does not provide practical deployment strategies for the proposed adaptive per-token computation approach.
✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.