Bluesky Thread

my dumbass version of this thread:

June 17, 2025 View original thread

my dumbass version of this thread:

Minimax M1 discovered a new RL algrithm, CISPO that does all the post-training on their huge 456B model for ~$500k instead of ... idk, millions?

it all hinges on realizing when the model found something worth learning and quickly doubling down

Nathan Lambert @natolambert.bsky.social

A common trend across recent research in using reinforcement learning to train reasoning models is that the clipping operation within a trust region (core to PPO, adopted by GRPO) is squashing rare tokens that are key to clever behaviors like verification or backtracking.

21 4

if you recall in the s1 paper, they discovered that the "Wait" token was unreasonably good for forcing a new line of thought

the models naturally learn Wait and other tokens, and when to use them to change things up and try something new (or Therefore, to double down)

timkellogg.me/blog/2025/02...

timkellogg.me

S1: The $6 R1 Competitor?

the thing is, overfitting is a huge problem in ML, and the chief way to avoid it is to slow down the learning rate

so if the model discovers "Wait" or "Therefore" in one batch, it avoids learning too much bc that would cause overfitting

but those are disproportionately valuable lessons..

CISPO has a slightly different take:

GRPO is already doing these in batches. if there's several samples in a batch that all discover "Wait", then scale the amount of learning proportionally

if there's signal, double down. if it's noise, tamper expectations

the net effect is that it can discover these "super tokens" in far fewer training cycles, leading to better models for lower cost

More like this