Random Matrix Theory of Early-Stopped Gradient Flow: A Transient BBP Scenario

Florentin Coeurdoux; Grégoire Ferré; Jean-Philippe Bouchaud

✨ TL;DR

This paper develops a random matrix theory model that explains why neural networks exhibit a transient learning window where signal is detectable before overfitting occurs. The key mechanism is that anisotropy in input data creates fast and slow learning directions, causing a learnable eigenvalue to temporarily separate from noise before being reabsorbed.

01 · Problem

Empirical observations of neural network training reveal a puzzling transient regime: there exists a finite time window during gradient descent where the model successfully captures signal, but this signal later disappears as overfitting takes over. This phenomenon is commonly addressed through early stopping in practice, yet lacks theoretical understanding. The challenge is to explain why and when this transient learning window occurs, and what factors control its duration and existence. Understanding this requires analyzing the time-dependent spectral properties of learned weight matrices in the presence of noise and structured input data.

02 · Approach

The authors construct an analytically tractable random matrix model of gradient flow in a linear teacher-student setting with anisotropic input covariance. They model the input covariance as a two-block structure that creates fast and slow learning directions. The analysis focuses on the time-dependent bulk spectrum of the symmetrized weight matrix, which they derive through a 2×2 Dyson equation. For a rank-one teacher signal, they obtain an explicit outlier condition using a rank-two determinant formula. This framework allows them to track when an isolated eigenvalue (representing learned signal) separates from the noisy bulk spectrum and when it gets reabsorbed, characterizing a time-dependent Baik-Ben Arous-Péché (BBP) phase transition.

03 · Key insights

What the paper shows.

01Anisotropy in input covariance is the key ingredient that enables transient learning by creating fast and slow directions in gradient dynamics

02The learning signal manifests as an isolated eigenvalue that can temporarily separate from the noisy bulk spectrum before being reabsorbed in the overfitting regime

03Three distinct phases emerge depending on signal strength and covariance anisotropy: no emergence, persistent emergence, or transient emergence of the teacher spike

04Early stopping can be understood as a spectral phenomenon where optimal generalization occurs during the transient window when signal eigenvalues are maximally separated from noise

04 · Results

The theory produces complete phase diagrams mapping the three regimes (no spike, persistent spike, transient spike) as functions of signal strength and covariance anisotropy parameters. The 2×2 Dyson equation successfully predicts the full time evolution of the bulk spectrum, while the rank-two determinant formula accurately identifies when outlier eigenvalues emerge and disappear. Finite-size numerical simulations validate the theoretical predictions, confirming that the random matrix framework correctly captures the transient BBP transition. The model successfully reproduces the empirically observed early-stopping window as a consequence of the interplay between anisotropic learning dynamics and noise.

05 · Limitations

The analysis is restricted to a linear teacher-student setting with gradient flow (continuous-time limit), which may not fully capture the complexity of discrete gradient descent in nonlinear deep networks. The two-block covariance model, while analytically tractable, represents a simplified form of anisotropy compared to real-world data distributions. The framework assumes a specific random matrix ensemble and may not generalize to all forms of structured data or network architectures. Additionally, the study focuses on the symmetrized weight matrix spectrum, which may not directly correspond to all relevant learning metrics in practical settings. The paper does not address how these insights extend to stochastic gradient descent with mini-batches or adaptive learning rates commonly used in practice.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers