✨ TL;DR
This paper explains why full-batch gradient descent on neural networks drives the largest Hessian eigenvalue toward the threshold 2/η (the Edge of Stability phenomenon) through a novel functional called edge coupling. The work provides a unified theoretical explanation for this self-regulating behavior that has previously resisted complete understanding.
The Edge of Stability is an empirically observed phenomenon where full-batch gradient descent on neural networks causes the largest Hessian eigenvalue to converge to 2/η, where η is the learning rate. While previous work has established that the system exhibits self-regulation near this edge, there has been no unified explanation for why the trajectory is forced toward this specific threshold from arbitrary initialization. This gap in understanding limits our ability to predict and control neural network training dynamics.
The authors introduce edge coupling, a functional defined on consecutive iterate pairs whose coefficient is uniquely determined by the gradient descent update rule. By analyzing the criticality conditions of this functional, they derive a step recurrence relation with stability boundary at 2/η and a loss-change formula whose telescoping sum forces curvature toward 2/η. The key insight is using the mean value theorem to localize the different Hessian averages appearing in these formulas to the true Hessian at interior points, eliminating gaps in the forcing argument. The framework also classifies fixed points and period-two orbits by setting both gradients of the edge coupling to zero.
What the paper shows.
The paper establishes that the step recurrence derived from edge coupling criticality has a stability boundary at exactly 2/η, and the loss-change formula's telescoping sum forces the largest Hessian eigenvalue toward this threshold. The analysis of fixed points and period-two orbits reveals which directions support oscillatory behavior and on which side of the critical learning rate they appear, providing a complete characterization of the Edge of Stability phenomenon.
The analysis focuses on full-batch gradient descent on neural networks; applicability to stochastic variants or other optimization algorithms is not addressed. The paper does not discuss computational aspects of verifying the theoretical predictions empirically or provide extensive numerical validation. The framework's extension to practical scenarios with regularization, batch normalization, or other common training techniques is not explored.
✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.