a researcher on X explains why RL alone didn’t work before
it mostly comes down to that todays base models are smarter and have better exploration
GSK8 simply isn’t a hard enough test to grow interesting emergent behavior
x.com/its_dibya/st...
a researcher on X explains why RL alone didn’t work before
View original thread
21
1
this implies that today’s base models aren’t smart enough for tomorrow’s emergent behavior
we may need that distillation loop between reasoning models -> new pretraining data in order to elicit higher levels of behavior from RL
we may need that distillation loop between reasoning models -> new pretraining data in order to elicit higher levels of behavior from RL
4