Looped LLMs
Unlike RNNs, HRM, etc., this starts with an already pretrained LLM, adds a loop, and trains a little longer
it’s a very frugal approach to creating deeper models
github.com/mcleish7/ret...
Looped LLMs
View original thread
32
2
Performance bumps significantly either even just a little recurrence
one question i don’t see answered — can you do it dynamically? bc most tokens don’t need the extra compute
one question i don’t see answered — can you do it dynamically? bc most tokens don’t need the extra compute
4
they mention Baguettotron too. Depth seems to truly increase abilities significantly, there’s something special there
but more depth typically means more parameters to train. However they found a way around that, by looping! (and by starting with a pretrained LLM)
bsky.app/profile/timk...
but more depth typically means more parameters to train. However they found a way around that, by looping! (and by starting with a pretrained LLM)
bsky.app/profile/timk...
80 layers — for those not paying attention, @dorialexander.bsky.social has been posting for weeks about how small models with deep rather than wide layers exhibit eerie emergent behavior
this one is worth checking out
this one is worth checking out
5
1
some context: 80 layers is very deep for a small model
what if you took Qwen3-4B with 36 layers and looped it 4x? That's somewhat analogous to 144 layers
it won't do better on knowledge benchmarks, but it certainly gets us closer to that coveted cognitive core that goes external for knowledge
what if you took Qwen3-4B with 36 layers and looped it 4x? That's somewhat analogous to 144 layers
it won't do better on knowledge benchmarks, but it certainly gets us closer to that coveted cognitive core that goes external for knowledge
7
a rough rule of thumb is “training tokens per parameter”, to determine if it’s over/under trained
so this reduces both compute and data needs on initial pretrain
the downside is it’s deep, so it takes more inference time compute. But maybe there’s a missing depth scaling law for tiny models
so this reduces both compute and data needs on initial pretrain
the downside is it’s deep, so it takes more inference time compute. But maybe there’s a missing depth scaling law for tiny models