s1: Simple inference-time scaling
This is a simple small-scale replication of inference-time scaling
It was cheap: 16xH100 for 26 minutes (so what, ~$6?)
It replicates inference-time scaling using SFT only (no RL)
Extremely data frugal: 1000 samples
arxiv.org/abs/2501.19393
s1: Simple inference-time scaling
View original thread
29
7
The quirkiest part of this paper is forced budgeting:
you can force the LLM to stay in it's thinking stage for even longer by inserting words like "Wait" when it tries to terminate via ""
They did many variants with different thresholds and confirmed the inference-time scaling laws
you can force the LLM to stay in it's thinking stage for even longer by inserting words like "Wait" when it tries to terminate via ""
They did many variants with different thresholds and confirmed the inference-time scaling laws
1
Dataset mixture
They showed that dataset design is crucial, and you really don't even need that much data
Important parts:
1. Quality (e.g. formatting not broken)
2. Difficulty (longer trace or lower performance = harder)
3. Diversity (more subjects)
They showed that dataset design is crucial, and you really don't even need that much data
Important parts:
1. Quality (e.g. formatting not broken)
2. Difficulty (longer trace or lower performance = harder)
3. Diversity (more subjects)
The R1 paper, OTOH, mixed RL & SFT but people seem to get most excited about R1-Zero, the pure-RL training run
s1 indicates that you can get a long way on SFT alone, so maybe R1 was right to also interleave them
s1 indicates that you can get a long way on SFT alone, so maybe R1 was right to also interleave them
3
FYI I used o3-mini to validate my understanding of the paper. Feel free to read the process
My favorite part of o3 is that it's comfortable stating opinions rather than simply sticking to facts
chatgpt.com/share/67a0f7...
My favorite part of o3 is that it's comfortable stating opinions rather than simply sticking to facts
chatgpt.com/share/67a0f7...
2
1 hour later
The most surprising part, maybe, is that adding data didn’t improve performance **at all** (o3-mini nailed this, it’s a big deal)
Their full dataset was 56K samples, they narrowed it down to 1K high signal samples. Any data beyond that (even the full dataset) barely nudged the performance
Their full dataset was 56K samples, they narrowed it down to 1K high signal samples. Any data beyond that (even the full dataset) barely nudged the performance
3
the cool part about these small scale experiments is the comprehensive ablation studies. like check this out. my only complaint is they didn't do "oh crap"
2