Bluesky Thread

Optimizers are more important than we thought

July 29, 2025 View original thread

Optimizers are more important than we thought

Did you use Kimi K2 and think, "this seems different"? Some people posited that the MuonClip optimizer impacts the actual behavior of the model, not just convergence speed.

Indeed, a new paper landed that shows us this.

arxiv.org/abs/2507.12224

arxiv.org

Optimizers Qualitatively Alter Solutions And We Should Leverage This

Due to the nonlinear nature of Deep Neural Networks (DNNs), one can not guarantee convergence to a unique global minimum of the loss when using optimizers relying only on local information, such as SG...

25 2

The take-away: We should expect to see a lot more movement in optimizers. Same old models rebuilt with new optimizers

Also: K2 really is different. Yes, it's in the post-training, but it's also in the optimizer

Side-by-side comparison of optimizer effects on catastrophic forgetting and feature alignment in class-incremental MNIST:

**Left plot (line chart):**

* Y-axis: "Accuracy of final trained parameters on test split of class pair"
* X-axis: label sets (class pairs trained sequentially): (0,1), (2,3), (4,5), (6,7), (8,9)
* Three lines:

* **Adam (lr=1e-3)** in blue: shows near-zero accuracy until the final class pair (8,9), where accuracy spikes to 1.0
* **Adam (lr=1e-4)** in green: flatlined at 0 accuracy for all class pairs except final one
* **Shampoo** in orange: shows steady increase in accuracy across class pairs, avoiding catastrophic forgetting

**Right (heatmaps):**

* Title: "Effect of optimizer on cross-class representation cosine similarity"
* Three square heatmaps show cosine similarities between class representations for:

* **Shampoo**: shows more variation in similarity values, with lower overall similarity (less degeneracy)
* **Adam (lr=1e-3)**: higher and more uniform similarities across classes (more degenerate)
* **Adam (lr=1e-4)**: similar to lr=1e-3, slightly less degenerate but still uniformly high similarity

**Caption:**
Figure 4: Left: catastrophic forgetting in a 2-layer MLP trained on class-incremental MNIST, where
the network trains on each pair of classes sequentially. All networks exhibit worse performance
on earlier class pairs, but the decline in performance is much sharper for Adam than for Shampoo.
This effect is not mitigated by reducing the learning rate on Adam. Right: visualization of the
alignment between features of different classes in each network. Features are more degenerate (higher
cross-class cosine similarity) when training with Adam than with Shampoo

6 1

More like this