Muon optimizer learns more from rare data than Adam
i need to dig deeper. i think the industry is coalescing on:
- Adam is faster
- Muon is more stable
- now, Muon learns rare data better
i wonder if that’s why K2 has that vibe that it has
arxiv.org/abs/2509.26030
More like this
×