80 layers — for those not paying attention, @dorialexander.bsky.social has been posting for weeks about how small models with deep rather than wide layers exhibit eerie emergent behavior
this one is worth checking out
80 layers — for those not paying attention, @dorialexander.bsky.social has be...
View original threadwhile being the most French model yet, they had to rationalize why it wasn’t trained on French
but fr imagine being able to do ablations on THE ENTIRE end-to-end training process. you’d learn so much
but fr imagine being able to do ablations on THE ENTIRE end-to-end training process. you’d learn so much
6
1
i’m surprised! i expected them to train in fp32, but no, they went with a legit bf16
2
@dorialexander.bsky.social i wish you blessings in the form of billions of euros in funding
8
so sweet
5