it’s true, the last two Qwens have been beautiful works of art, until you talk to them. gpt-oss same, even GPT-5 to some extent
specifically this is referring to model architecture. cool new optimizers (K2) don’t seem to do it
it’s true, the last two Qwens have been beautiful works of art, until you tal...
View original thread
24
2
my hunch is there’s something similar between overly math RL’d models and overly sparse models, because you get a similar result with too much math-only RL too
i bet RL on a narrow set of tasks shuts down a lot of pathways/circuits, and sparse MoE is same but with those pathways physically removed
i bet RL on a narrow set of tasks shuts down a lot of pathways/circuits, and sparse MoE is same but with those pathways physically removed
9
6
1
8
1