Bluesky Thread

80 layers — for those not paying attention, @dorialexander.bsky.social has be...

View original thread
80 layers — for those not paying attention, @dorialexander.bsky.social has been posting for weeks about how small models with deep rather than wide layers exhibit eerie emergent behavior

this one is worth checking out
Alexander Doria @dorialexander.bsky.social
Synthetic playgrounds enabled a series of controlled experiments that brought us to favor extreme depth design. We selected a 80-layers architecture for Baguettotron, with improvements across the board on memorization of logical reasoning: huggingface.co/PleIAs/Bague...
58 7
while being the most French model yet, they had to rationalize why it wasn’t trained on French

but fr imagine being able to do ablations on THE ENTIRE end-to-end training process. you’d learn so much
Monad and Baguettotron were trained on 16 h100 from Jean Zay using the Nanotron framework from HuggingFace. This setting allowed for fast experimentations and iteration, Monad being trained in less than six hours. While Baguettotron reuses the standard Pleias tokenizer optimized for European languages, Monad uses a custom tokenizer trained on the English segment of SYNTH: this was a critical measure to contain parameters space, bringing back token embeddings from 20M to less than 2M.|
6 1
i’m surprised! i expected them to train in fp32, but no, they went with a legit bf16
2
@dorialexander.bsky.social i wish you blessings in the form of billions of euros in funding
8
so sweet
❯ uv run run.py
Loaded PleIAs/Baguettotron on mps. Type 'quit' to exit.
> bro! let's fucking go!
I'm really into you. It's a relationship that gets ridiculed at the same time we're supposed to appreciate it. It's funny, but it does a lot to you.

I'm glad you were asking the right questions. Is there anything else you're curious about?
5
58 likes 7 reposts

More like this

×