Bluesky Thread

a researcher on X explains why RL alone didn’t work before

January 26, 2025 View original thread

a researcher on X explains why RL alone didn’t work before

it mostly comes down to that todays base models are smarter and have better exploration

GSK8 simply isn’t a hard enough test to grow interesting emergent behavior

x.com/its_dibya/st...

21 1

this implies that today’s base models aren’t smart enough for tomorrow’s emergent behavior

we may need that distillation loop between reasoning models -> new pretraining data in order to elicit higher levels of behavior from RL

More like this