Bluesky Thread

Looped LLMs

View original thread
Looped LLMs

Unlike RNNs, HRM, etc., this starts with an already pretrained LLM, adds a loop, and trains a little longer

it’s a very frugal approach to creating deeper models

github.com/mcleish7/ret...
A schematic diagram comparing a layered model architecture (left) with a modified recurrent design (right).

On the left, a vertical stack of labeled blocks represents model layers from L0 to L21:
	•	The top layers L0–L1 are in light blue, with a dotted arrow leading right.
	•	Middle layers L2–L15 are shaded gray and marked “Removed.”
	•	Layers L16–L19 are dark green, connected by a green dotted arrow to the right.
	•	The bottom layers L20–L21 are red, connected by a red dotted arrow.

On the right, a corresponding vertical flow shows the new structure:
	•	At the top is a Prelude block (light blue).
	•	Below it, a small dark green box labeled s₀ ~ N(0, σ²)ᵛᵃʳᶦᵃᵇˡᵉ appears beside e and sᵢ.
	•	Next comes an Adapter (orange).
	•	A large Recurrent Block (dark green) includes a looping green arrow that feeds back into itself, showing iteration producing sᵢ₊₁.
	•	At the bottom are Coda (red) and p.

The diagram overall illustrates converting a deep stack of transformer layers into a compact recurrent architecture, where early layers form the prelude, middle ones are replaced by a recurrent block, and the final layers form the coda.
32 2
Performance bumps significantly either even just a little recurrence

one question i don’t see answered — can you do it dynamically? bc most tokens don’t need the extra compute
A line graph comparing model accuracy versus test recurrence levels.

The x-axis is labeled “Test Recurrence” with values 1, 2, 4, 8, 16, and 32.
The y-axis is labeled “Accuracy”, ranging from 0.1 to 0.5.

Three lines are shown:
	•	Orange line (Train Recurrence 4): Starts around 0.18 accuracy at recurrence 1, rises quickly to about 0.32 at 2, then plateaus around 0.39 beyond recurrence 8.
	•	Blue line (Train Recurrence 16): Starts low near 0.11 at recurrence 1, climbs steeply to about 0.33 at 2, and peaks around 0.45 at recurrence 8–32.
	•	Green horizontal line (TinyLlama Non-Recurrent): Constant at roughly 0.27 accuracy across all recurrence levels.

The plot shows that models trained with higher recurrence perform better at higher test recurrences, while the non-recurrent baseline remains flat.
4
they mention Baguettotron too. Depth seems to truly increase abilities significantly, there’s something special there

but more depth typically means more parameters to train. However they found a way around that, by looping! (and by starting with a pretrained LLM)

bsky.app/profile/timk...
Tim Kellogg @timkellogg.me
80 layers — for those not paying attention, @dorialexander.bsky.social has been posting for weeks about how small models with deep rather than wide layers exhibit eerie emergent behavior

this one is worth checking out
5 1
some context: 80 layers is very deep for a small model

what if you took Qwen3-4B with 36 layers and looped it 4x? That's somewhat analogous to 144 layers

it won't do better on knowledge benchmarks, but it certainly gets us closer to that coveted cognitive core that goes external for knowledge
Model	Layers
K2 (Instruct)	61 (includes 1 dense layer) 
Hugging Face
K2-Thinking	61 (same architecture) 
Hugging Face
GLM-4.6	92 (from num_hidden_layers in config) 
Hugging Face
GPT-OSS-120B	36 
OpenAI
GPT-OSS-20B	24 
OpenAI
Qwen3-4B	36 
Hugging Face
Qwen3-32B	64 
Hugging Face
Qwen3-30B-A3B (MoE)	48
7
a rough rule of thumb is “training tokens per parameter”, to determine if it’s over/under trained

so this reduces both compute and data needs on initial pretrain

the downside is it’s deep, so it takes more inference time compute. But maybe there’s a missing depth scaling law for tiny models
32 likes 2 reposts

More like this

×