on the
Edge of
stability.
Exploring the balance between sharpness and stability with modern optimizers to build more efficient, predictable, and high-performing deep learning models.
Sharpness &
Stability
When training a model in deep learning, you may have noticed that at a certain point, training loss goes from smoothly decreasing to suddenly having these weird “spikes”. Research has linked these spikes to sharpness, a measure of the curvature of the loss landscape at a given point. Mathematically, sharpness, which we'll denote as \(\lambda_{\max}\), is the maximum eigenvalue of the Hessian matrix, the second order derivative, of the loss function. When training, a process called progressive sharpening occurs, in which the sharpness gradually increases. Progressive sharpening continues to occur until sharpness reaches a threshold of \(2/\eta\), where \(\eta\) is the learning rate, as it then enters a regime called the edge of stability. This edge of stability is characterized by the stoppage of progressive sharpening and sporadic oscillations in the still decreasing training loss.
Adjust the sliders to see how Gradient Descent behaves on a simple quadratic loss landscape: \(f(x) = \frac{1}{2} \lambda x^2\). By taking the second derivative, \(\frac{d^2f}{dx^2} = \lambda\), we see that sharpness is defined by \(\lambda\).
Notice that as long as the learning rate \(\eta\) is strictly less than \(2/\lambda\), the optimizer successfully converges to the minimum at the bottom of the curve.
However, if you push the learning rate or the sharpness too high so that \(\eta > 2/\lambda\), the updates become unstable. The optimizer overshoots the minimum by an increasingly large margin at each step, violently diverging up the walls of the loss landscape.
Conventional optimization theory states that in classic gradient descent, when sharpness exceeds \(2/\eta\), training should diverge, as shown in the interactive diagram above. However, this is not the case for deep learning models using gradient descent. In fact, upon breaching this \(2/\eta\) threshold, sharpness often lingers right at or just above the value of \(2/\eta\). And despite meeting and exceeding the threshold, training loss continues to decrease though with the aforementioned “spikes.” This means that the algorithm somehow successfully optimizes the training objective without increasing the sharpness.
This counterintuitive behavior lends meaning to the name edge of stability, where the edge is a thin margin where training is most effective yet barely stable. Below this edge, learning is sluggish; above it, as in the case of an aggressively high learning rate, the optimizer diverges. It is only by navigating this threshold that the neural network actually trains in an efficient manner.
The edge of stability has mainly focused on gradient descent as its primary optimizer. However, gradient descent is seldom used in real-world models. Therefore, this project attempts to establish the presence of the edge of stability to other contemporary optimizers including stochastic gradient descent, Adam, Muon, and Shampoo. We especially promote our analysis of matrix based optimizers of Muon and Shampoo, which has little to no research in regards to edge of stability. We hope this project helps introduce the idea of edge of stability and contributes to further research of optimization in deep learning.
To empirically investigate these optimization dynamics, we employed a standardized experimental framework. Our models were trained on a 5,000-image subset of the CIFAR-10 dataset, consisting of \(32\times32\) RGB images across ten balanced classes. We utilized a fully connected neural network (MLP) architecture, optimized against a Mean Squared Error (MSE) objective function. To track the local geometry of the loss landscape, we utilized the Power Iteration method to provide a computationally efficient estimate of the Hessian's maximum eigenvalue, which serves as our proxy for sharpness.
The main hyperparameters are the learning rate \(\eta\) and optimizer choice, though batch size, epochs, and other settings also play a large role in our experiments. The specifics can be found in our codebase. Additionally, our report provides a more theoretical discussion of the findings, including the proofs, procedures, and analysis on additional optimizers.
Let's first set the baseline and understand what the edge of stability looks like using gradient descent. Gradient Descent (GD) is the foundational optimization algorithm used to train deep neural networks. It operates by calculating the gradient of the loss function with respect to the model's parameters across the entire training dataset. The network's parameters are then updated by taking a step in the direction of the steepest descent, scaled by a fixed learning rate, \(\eta\). As we know from above, when sharpness passes \(2/\eta\), GD enters the edge of stability, causing progressive sharpening to stop and loss to decrease with spikes.
Unlike its full-batch counterpart, Stochastic Gradient Descent (SGD) approximates the true gradient by sampling a small subset of the data, or a "mini-batch," at each step. This introduces stochastic noise into the optimization process, defined by the update rule \(\theta_{t+1} = \theta_t - \eta \nabla L_B(\theta_t)\). In the context of the Edge of Stability, while full-batch GD is constrained by the maximum eigenvalue of the Hessian across the entire dataset, SGD interacts with the local batch sharpness, which measures the expected directional curvature of the mini-batch loss surface along the direction of the step. While the full Hessian sharpness remains bounded just under the \((2 / \eta)\) threshold, the batch sharpness can exceed the this threshold and is more closely linked to the loss spikes. Therefore, it is this batch sharpness, that enters a regime called the edge of stochastic stability (EoSS).
Sharpness seems to be influenced by the batch size, with larger batch sizes showcasing sharpness values closer to the stability boundary of \(2 / \eta\).
Batch sizes also appears to have an inverted relationship on batch sharpness and full Hessian sharpness. Low batch sizes show higher batch sharpness values but lower full Hessian sharpness values.
Illustrated are 3 different batch sizes \(\mathcal{B}\) used to train the same model with a learning rate of \(\eta = 0.05\), meaning they all have a threshold of \(2 / \eta = 40\).
Adam is an adaptive optimizer that scales each parameter update using a preconditioner built from recent gradients. Unlike plain gradient descent, Adam is not limited by the raw curvature of the loss. It operates at the Adaptive Edge of Stability (AEoS), where the maximum eigenvalue of the preconditioned Hessian, called the preconditioned sharpness \(\lambda_A\), stays near the stability threshold
\[\lambda_{\text{preconditioned max}} < \frac{(1+\beta_1)2}{(1-\beta_1)\eta}\]
Here \(\eta\) is the learning rate and \(\beta_1\) is Adam’s first-moment decay, which controls how strongly recent gradients influence the momentum term. This is what lets Adam keep moving through high-curvature regions without becoming unstable in the same way as non-adaptive methods.
What the plots below show. The first two plots compare how changing the learning rate \(\eta\) and the momentum parameter \(\beta_1\) affect both train loss and preconditioned sharpness over time. The bottom plot highlights the main point: the solid curves show the raw sharpness \(\lambda_H\) continuing to rise, while the dashed curves show the preconditioned sharpness \(\lambda_A\) staying close to the edge. Adam keeps adapting as curvature grows, which is what makes its dynamics different.
Muon is a matrix-based optimizer designed for application to inner weight matrices of neural networks. Instead of adapting each parameter independently, Muon operates on the entire weight matrix of a layer and modifies the update using matrix structure. In particular, it first forms a momentum update and then orthogonalizes this update before applying it to the parameters.
Muon shows some characteristics of the edge of stability regime like progressive sharpening and destabilization of the loss. However, there is a major difference with Muon is non-monotonic relationship between the learning rate and the maximum sharpness reached during training. In coordinate optimizers, this relationship is strictly inversely proportional: as the learning rate decreases, the network enters sharper regions of the loss landscape. While this relationship is true for very small Muon learning rates, it is the opposite for most learning rates. As learning rate increases, we see that the network can tolerate sharper loss regions.
Shampoo is a structured second-order optimizer that approximates curvature information while remaining practical for large neural networks. Instead of maintaining a full preconditioning matrix over all parameters, Shampoo factorizes curvature along each tensor dimension. For a weight matrix \(W \in \mathbb{R}^{m \times n}\) with gradient \(G_t\), it accumulates second-moment statistics separately for the row and column directions.
\[ L_t = L_{t-1} + G_t G_t^\top, \qquad R_t = R_{t-1} + G_t^\top G_t \]
These matrices act as left and right preconditioners that rescale the gradient during the update. The resulting update rule \(W_{t+1} = W_t - \eta \, L_t^{-1/4} \tilde{G}_t R_t^{-1/4}\) approximates the inverse square root of a Kronecker-factored curvature matrix \(L_t^{1/2} \otimes R_t^{1/2}\). Because the gradient is rescaled differently along each dimension, Shampoo produces direction-dependent step sizes that help it adapt to sharp and flat directions of the loss landscape.
Under our experimental conditions, the Shampoo optimizer does not exhibit the classical edge-of-stability phenomenon. Unlike SGD and Adam, Shampoo's sharpness decreases throughout training rather than increasing toward the stability threshold. This suggests that Shampoo's preconditioner suppresses updates in high-curvature directions rather than allowing sharpness to accumulate. Whether a different stability threshold governs Shampoo's dynamics remains an open question.
Training deep networks is not just about driving the loss down. The curvature of the loss landscape, or sharpness, keeps changing as we train, and optimizers respond in different ways. We started from the idea that sharpness tends to rise until it hits a stability limit tied to the learning rate. That’s the edge of stability, and in practice you see it as loss curves that bounce around while still trending downward.
Full-batch gradient descent rides that boundary. It pushes sharpness up until it sits at \(2/\eta\), then keeps going without blowing up. SGD does something similar, except the thing that matters is batch sharpness, not the Hessian over the full dataset. Adam is different. It adapts to the landscape, so its preconditioned sharpness stays near a different threshold and it can keep stepping into high-curvature regions while the raw sharpness of the loss keeps growing. The plots we showed let you see how \(\eta\) and \(\beta_1\) shape both the loss and the preconditioned sharpness over time. Muon and Shampoo take these ideas further with other preconditioning strategies.
If you understand how each optimizer hits that edge, it’s easier to see why training looks the way it does and how to tune or design methods that are fast and stable. We hope this page gives you a clear picture of that.