Bluesky Thread

Muon optimizer learns more from rare data than Adam

View original thread
Muon optimizer learns more from rare data than Adam

i need to dig deeper. i think the industry is coalescing on:

- Adam is faster
- Muon is more stable
- now, Muon learns rare data better

i wonder if that’s why K2 has that vibe that it has

arxiv.org/abs/2509.26030
arxiv.org
Muon Outperforms Adam in Tail-End Associative Memory Learning
The Muon optimizer is consistently faster than Adam in training Large Language Models (LLMs), yet the mechanism underlying its success remains unclear. This paper demystifies this mechanism through th...
26 3
btw both Kimi K2 & GLM >=4.5 were trained with Muon

maybe others, not sure. Adam was classically the preferred optimizer. Now, there’s no clear preference anymore. trade-offs all the way down
5 1
26 likes 3 reposts

More like this

×