Essential AI

Muon Doesn’t Clearly Grok faster

June 20, 2025

After a decade of dominance, we're seeing promising second-order alternatives to Adam emerge. Among them, Muon has the right balance of simplicity and performance. In our prior work [muon], we showed that Muon achieves better compute-time tradeoffs during pre-training. Motivated by recent claims that Muon accelerates grokking [Tveit et al.], we explore grokking as a potential testbed to better understand how different optimizers affect learning dynamics, such as memorization and generalization.

Based on our results, we find grokking to be an insufficient testbed for disentangling the learning dynamics of the optimizers we study. Specifically, as [Tveit et al.] observe, Muon indeed groks faster than AdamW under certain conditions, however, if we broaden the search space to different hyperparameters and model sizes, this advantage disappears. The onset and duration of grokking were highly sensitive to factors such as batch size and embedding dimension, which made it difficult to isolate optimizer-specific dynamics. While there might exist a pattern to the conditions where one optimizer groks faster than another, we leave that to future work. Nonetheless, we believe this investigation was a necessary step toward that goal.

While we only offer empirical evidence, we hope our findings help the community understand grokking better and encourage further work in disentangling the learning behaviors of different optimizers.

Goal of the study

Grokking is the phenomenon where models achieve perfect training accuracy early but continue to perform poorly on the test set, only to generalize after prolonged overfitting. The figure below depicting train and validation accuracy is an example of grokking as seen in our experiments:

Example of grokking showing train and validation accuracy over time

In particular, we aim to explore the following: Does Muon achieve better token efficiency than AdamW in an algorithmic grokking task?

1. How does the rank of the gradient update affect this?

2. How do various hyper-parameters such as the embedding dimension of the model and batch size affect these tradeoffs?

Our approach

We evaluate on a modular division dataset with a base of 97, applying a 50-50 split between training and testing data [Power et al.]. We define grokking as the point during training at which the model's validation accuracy comes within 1% of the maximum validation accuracy attained throughout the run. To be considered as exhibiting grokking, the model must achieve a minimum validation accuracy of 95% at some point during training.

All experiments are conducted on a small transformer model with a single layer.

Measuring Grokking start

Grokking start is a metric that's rarely reported, yet we find it varies significantly across different experimental settings. We define it as the minimum of the following two steps:

1. The step at which the first-order gradient reaches its maximum.

2. The step at which the second-order gradient reaches its maximum.

To mitigate noise, all gradients are smoothed and normalized, with outliers removed. The figure below illustrates the overall process.

Illustration of the grokking measurement process

The motivation behind this method is that at the onset of grokking, evaluation scores typically rise sharply. We therefore expect the first- and second-order gradients of these scores to be high. Manual visual inspection confirmed this intuition, suggesting the method reliably identifies the grokking onset.

What we found

💡 Optimizer labels for the plots:

0p1: Muon optimizer with gradient update rank forced to be 0.1*full_rank using SVD

0p25: Muon optimizer with gradient update rank forced to be 0.25*full_rank using SVD

0p5: Muon optimizer with gradient update rank forced to be 0.5*full_rank using SVD

0p75: Muon optimizer with gradient update rank forced to be 0.75*full_rank using SVD

0p8: Muon optimizer with gradient update rank forced to be 0.8*full_rank using SVD

1: Full rank muon optimizer implemented using SVD

muon: Standard muon optimizer implemented using Newton-Schulz

1. Increasing the base embedding dimension leads to faster Grokking.

Chart showing relationship between embedding dimension and grokking speed

2. Higher batch size seems to slow down Grokking.

Chart showing relationship between batch size and grokking speed

3. Muon didn't outperform AdamW consistently, we found that variations in hyperparameters affected the relative performance of the two optimizers. We couldn't surface a relationship between the token efficiency of Muon and AdamW, unlike recent works [Tveit et al].

Chart comparing Muon and AdamW performance across different settings

Conclusion

Overall, we find that the base embedding dimension and batch size play a critical role in grokking. However, regarding the learning dynamics of optimizers, we were unable to identify a clear relationship between generalization speed and optimizer rank. Our experiments showed that different settings favored different optimizers, with no single optimizer consistently performing best.

Resources

📄

Tveit et al. - Muon Accelerates Grokking

🧰

Original Muon Optimizer Introduction

🔍

Power et al. - Grokking: Generalization Beyond Overfitting