my dumbass version of this thread:
Minimax M1 discovered a new RL algrithm, CISPO that does all the post-training on their huge 456B model for ~$500k instead of ... idk, millions?
it all hinges on realizing when the model found something worth learning and quickly doubling down
my dumbass version of this thread:
View original threadif you recall in the s1 paper, they discovered that the "Wait" token was unreasonably good for forcing a new line of thought
the models naturally learn Wait and other tokens, and when to use them to change things up and try something new (or Therefore, to double down)
timkellogg.me/blog/2025/02...
the models naturally learn Wait and other tokens, and when to use them to change things up and try something new (or Therefore, to double down)
timkellogg.me/blog/2025/02...
6
the thing is, overfitting is a huge problem in ML, and the chief way to avoid it is to slow down the learning rate
so if the model discovers "Wait" or "Therefore" in one batch, it avoids learning too much bc that would cause overfitting
but those are disproportionately valuable lessons..
so if the model discovers "Wait" or "Therefore" in one batch, it avoids learning too much bc that would cause overfitting
but those are disproportionately valuable lessons..
8
CISPO has a slightly different take:
GRPO is already doing these in batches. if there's several samples in a batch that all discover "Wait", then scale the amount of learning proportionally
if there's signal, double down. if it's noise, tamper expectations
GRPO is already doing these in batches. if there's several samples in a batch that all discover "Wait", then scale the amount of learning proportionally
if there's signal, double down. if it's noise, tamper expectations
5
the net effect is that it can discover these "super tokens" in far fewer training cycles, leading to better models for lower cost
6