Bluesky Thread

it’s true, the last two Qwens have been beautiful works of art, until you tal...

View original thread
it’s true, the last two Qwens have been beautiful works of art, until you talk to them. gpt-oss same, even GPT-5 to some extent

specifically this is referring to model architecture. cool new optimizers (K2) don’t seem to do it
Teortaxes
(DeepSeek 推特...
•10h g
deeply disappointing pattern: the fancier the model, the higher its efficiency, the worse the personality. Qwen3-next is maybe smarter than 30B-3AB but even more insufferably slopped.
GPT-OSS is a dead robot. It's not just benchmaxxing, something more is going on here I think.
24 2
my hunch is there’s something similar between overly math RL’d models and overly sparse models, because you get a similar result with too much math-only RL too

i bet RL on a narrow set of tasks shuts down a lot of pathways/circuits, and sparse MoE is same but with those pathways physically removed
9
DARE logo with tagline “to resist math and coding”
6 1
DARE logo with tagline, “to keep kids safe from overly sparse MoEs”
8 1
24 likes 2 reposts

More like this

×