Bluesky Thread

HRM analysis by @dorialexander.bsky.social

August 04, 2025 View original thread

HRM analysis by @dorialexander.bsky.social

the actual shocking parts:

* it doesn’t overfit
* ARC-AGI is only hard for language models

i think we’ll be seeing more of HRM

Ok a few notes on hierarchical reasoning model (which would be more aptly named hierarchical recurrent transformers).
First I'm not surprised at all: small specialized models are very undertapped right now. They have always remain extremely performant for their size range on dedicated tasks like RL simulations, OCR, specialized image classification/segmentation. The key thing is that you can decide what representation is optimal for the tasks: vision language models have to deal with highly composite patches and tokens, while RNN OCR model work straight with letters and piwels.
Small models are in fact even more undertapped as they benefit directly from LLM innovations at two levels:
1. architecture (hrm is not only processing transformers blocks include the whole set: attention rope, swiglu...)
2. Data (less critical for hrm as we'll see). We can suddenly unlock a massive amount of structured/simulated synthetic data.

Now what distinguishes HRM is not reasoning.
I'm going to stress it but: language models do reason all the time. Reasoning traces is kind of a misnomer but if you take a snapshot of an attention graph trying to solve an equation at time n, this is obviously non-verbalized reasoning.
What does make HRM actually promising: it seems extremely resilient to overfitting.
I was surprised to see that the 27 million parameters needed to be trained for 50-100 h100 hours — that's actually more than enough for a widely over-chinchilla optimized pretraining, like 50-200B tokens. Now the real trick: training is done over 50,000 epochs, so 50,000 repetition of the same set. With most deep learning architecture this would be the perfect recipe for overfitting. Instead they seem to have very carefully designed their architecture to avoid that with a separation in two modules, a low level for fast computation (L-module) and a high level (H-module) for general planning/ regularization. L-Module is constantly reset which seemingly avoid premature convergence.

There are other nice aspects that seem to be markedly inspired by the experience we get with language reasoning models but I'll need more time to dig.
A further catch: it confirms ARC-AGI is only hard in a language model context. That was the entire motivation of the evaluation set: something seemingly simple for humans that language models can't solve, mostly as AR continues to struggle with any spatial task (*cough* that IMO 2025 exercise 6 *couch*). As we see clearly now, since the exercise is accessible to humans it does mean it must be approachable with alternative methods.
Now where does it leads us? There is definitely a renewed interest for "pure" RL or alternative architecture for world model like JEPA (since it's also evidently clear that LLM/LRM struggle more than is needed). Maybe similarly to the way the brain is structured we need to come back to some concept of modularity and this will be critical to unlock the next phase of Al.

Tim Kellogg @timkellogg.me

HRM: Hierarchical Reasoning Model

ngl this sounds like bullshit but i don’t think it is

- 27M (million parameters)
- 1000 training examples
- beats o3-mini on ARC-AGI

arxiv.org/abs/2506.21734

37 7

16 hours later

another take, even more depth & nuance

x.com/n8programs/s...

N8
N8 Programs & @N8Programs
My take on HRM (after reading the paper): its very beautifully mathematically and the architecture is somewhat vindicated by the experiments they perform. The architecture itself is a work of art - they ingeniously incorporate early-stopping (via Q-learning), avoid BPTT via a shallow approximation that's sparsely sampled every few steps of the model, etc.
The results they achieve on Sudoku & Maze-Solving are extremely impressive given the sparse amount of training data, and indicates their architecture has amazing sample efficiency
- its inductive biases (low-frequency & high-frequency representations to iteratively solve a problem) are conducive to great generalization.
The one red flag is the ARC-AGI methodology - they attach a "learnable special token that represents the puzzle it belongs to" to each training example - and the training examples are drawn from the train and evaluation sets. As I understand it, they hold out the singular example

drawn from the train and evaluation sets. As I understand it, they hold out the singular example for each eval task that the model would normally be expected to perform few-shot inference on.
Instead, at evaluation time, the model uses the learnable special token to recall the appropriate task and apply it to the input grid. This superficially is very distinct from ARC - memorizing a bag of functions and recalling one with a pre-determined ID seems antithetical to few-shot generalization.
However, they only use 960 examples - essentially the amount the original ARC + ConceptArc. This means many of their task IDs are learned from 3-4 examples during their very limited pre-training phase. There's no reason why, instead of setting the evaluation task ids in pretraining, they couldn't have a special novel task id they finetune at test time. Given say, three in-context examples, they could augment them into a 20-30 example train set and test-time tune a new task id on those examples. Then use that id for evaluation. Because their method has been demonstrated to have such extreme generalization from limited samples, this would likely yield a similar result, and would be fully in the spirit of ARC: generalizing at test-time from a

use that id for evaluation. Because their method has been demonstrated to have such extreme generalization from limited samples, this would likely yield a similar result, and would be fully in the spirit of ARC: generalizing at test-time from a limited set of examples (and would follow established ARC trends of test-time training).
So all in all - definitely not BS! Very serious exploration of a two-level recurrent transformer with extreme sample efficiency. Methodological issue on ARC that could be corrected fairly easily. Very exciting paper overall!
One note however:
It's important not to confuse the HRM (which is as of now, is a small architecture used to train networks that generalize across narrow problems) with LLMs in comparisons - one is a very specialized experimental tool, and the other is a behemoth that supports every linguistic task under the sun. HRMs, like AlphaGo or AlexNet or any pre-GPT neural network, are black boxes that know nothing of language and solve their tasks entirely in latent space (which was the goal of the paper authors!). This means they should never be compared one-to-one with large

of the paper authors!). This means they should never be compared one-to-one with large language models or treated as a potential replacement. Of course, since HRM is seq-to-seq and could be updated to support causal masking - you could use it to solve the objective of next-token prediction! But that would be an entirely different experiment, and take us back to the space of linguistic reasoning, which the authors wanted to avoid.

More like this