Bluesky Thread

Muon optimizer learns more from rare data than Adam

October 03, 2025 View original thread

Muon optimizer learns more from rare data than Adam

i need to dig deeper. i think the industry is coalescing on:

- Adam is faster
- Muon is more stable
- now, Muon learns rare data better

i wonder if that’s why K2 has that vibe that it has

arxiv.org/abs/2509.26030

arxiv.org

Muon Outperforms Adam in Tail-End Associative Memory Learning

The Muon optimizer is consistently faster than Adam in training Large Language Models (LLMs), yet the mechanism underlying its success remains unclear. This paper demystifies this mechanism through th...

26 3

btw both Kimi K2 & GLM >=4.5 were trained with Muon

maybe others, not sure. Adam was classically the preferred optimizer. Now, there’s no clear preference anymore. trade-offs all the way down

5 1

More like this