Bluesky Thread

Optimizers are more important than we thought

View original thread
Optimizers are more important than we thought

Did you use Kimi K2 and think, "this seems different"? Some people posited that the MuonClip optimizer impacts the actual behavior of the model, not just convergence speed.

Indeed, a new paper landed that shows us this.

arxiv.org/abs/2507.12224
arxiv.org
Optimizers Qualitatively Alter Solutions And We Should Leverage This
Due to the nonlinear nature of Deep Neural Networks (DNNs), one can not guarantee convergence to a unique global minimum of the loss when using optimizers relying only on local information, such as SG...
25 2
The take-away: We should expect to see a lot more movement in optimizers. Same old models rebuilt with new optimizers

Also: K2 really is different. Yes, it's in the post-training, but it's also in the optimizer
Side-by-side comparison of optimizer effects on catastrophic forgetting and feature alignment in class-incremental MNIST:

**Left plot (line chart):**

* Y-axis: "Accuracy of final trained parameters on test split of class pair"
* X-axis: label sets (class pairs trained sequentially): (0,1), (2,3), (4,5), (6,7), (8,9)
* Three lines:

  * **Adam (lr=1e-3)** in blue: shows near-zero accuracy until the final class pair (8,9), where accuracy spikes to 1.0
  * **Adam (lr=1e-4)** in green: flatlined at 0 accuracy for all class pairs except final one
  * **Shampoo** in orange: shows steady increase in accuracy across class pairs, avoiding catastrophic forgetting

**Right (heatmaps):**

* Title: "Effect of optimizer on cross-class representation cosine similarity"
* Three square heatmaps show cosine similarities between class representations for:

  * **Shampoo**: shows more variation in similarity values, with lower overall similarity (less degeneracy)
  * **Adam (lr=1e-3)**: higher and more uniform similarities across classes (more degenerate)
  * **Adam (lr=1e-4)**: similar to lr=1e-3, slightly less degenerate but still uniformly high similarity

**Caption:**
Figure 4: Left: catastrophic forgetting in a 2-layer MLP trained on class-incremental MNIST, where
the network trains on each pair of classes sequentially. All networks exhibit worse performance
on earlier class pairs, but the decline in performance is much sharper for Adam than for Shampoo.
This effect is not mitigated by reducing the learning rate on Adam. Right: visualization of the
alignment between features of different classes in each network. Features are more degenerate (higher
cross-class cosine similarity) when training with Adam than with Shampoo
6 1
25 likes 2 reposts

More like this

×