Muon Doesn’t Clearly Grok faster
June 20, 2025
After a decade of dominance, we're seeing promising second-order alternatives to Adam emerge. Among them, Muon has the right balance of simplicity and performance. In our prior work [muon], we showed that Muon achieves better compute-time tradeoffs during pre-training. Motivated by recent claims that Muon accelerates grokking [Tveit et al.], we explore grokking as a potential testbed to better understand how different optimizers affect learning dynamics, such as memorization and generalization.
Based on our results, we find grokking to be an insufficient testbed for disentangling the learning dynamics of the optimizers we study. Specifically, as [Tveit et al.] observe, Muon indeed groks faster than AdamW under certain conditions, however, if we broaden the search space to different hyperparameters and model sizes, this advantage disappears. The onset and duration of grokking were highly sensitive to factors such as batch size and embedding dimension, which made it difficult to isolate optimizer-specific dynamics. While there might exist a pattern to the conditions where one optimizer groks faster than another, we leave that to future work. Nonetheless, we believe this investigation was a necessary step toward that goal.
While we only offer empirical evidence, we hope our findings help the community understand grokking better and encourage further work in disentangling the learning behaviors of different optimizers.
Goal of the study
Grokking is the phenomenon where models achieve perfect training accuracy early but continue to perform poorly on the test set, only to generalize after prolonged overfitting. The figure below depicting train and validation accuracy is an example of grokking as seen in our experiments:
In particular, we aim to explore the following: Does Muon achieve better token efficiency than AdamW in an algorithmic grokking task?
1. How does the rank of the gradient update affect this?
2. How do various hyper-parameters such as the embedding dimension of the model and batch size affect these tradeoffs?
Our approach
We evaluate on a modular division dataset with a base of 97, applying a 50-50 split between training and testing data [Power et al.]. We define grokking as the point during training at which the model's validation accuracy comes within 1% of the maximum validation accuracy attained throughout the run. To be considered as exhibiting grokking, the model must achieve a minimum validation accuracy of 95% at some point during training.
All experiments are conducted on a small transformer model with a single layer.
Measuring Grokking start
Grokking start is a metric that's rarely reported, yet we find it varies significantly across different experimental settings. We define it as the minimum of the following two steps:
1. The step at which the first-order gradient reaches its maximum.
2. The step at which the second-order gradient reaches its maximum.
To mitigate noise, all gradients are smoothed and normalized, with outliers removed. The figure below illustrates the overall process.
The motivation behind this method is that at the onset of grokking, evaluation scores typically rise sharply. We therefore expect the first- and second-order gradients of these scores to be high. Manual visual inspection confirmed this intuition, suggesting the method reliably identifies the grokking onset.
What we found
💡 Optimizer labels for the plots:
1. Increasing the base embedding dimension leads to faster Grokking.
2. Higher batch size seems to slow down Grokking.
3. Muon didn't outperform AdamW consistently, we found that variations in hyperparameters affected the relative performance of the two optimizers. We couldn't surface a relationship between the token efficiency of Muon and AdamW, unlike recent works [Tveit et al].
Conclusion
Overall, we find that the base embedding dimension and batch size play a critical role in grokking. However, regarding the learning dynamics of optimizers, we were unable to identify a clear relationship between generalization speed and optimizer rank. Our experiments showed that different settings favored different optimizers, with no single optimizer consistently performing best.
Resources