RLP: Reinforcement Learning in Pre-Training
an NVIDIA paper explores using dense verifier-free RL in pretraining
this feels significant. Everything else I’ve been seeing is moving the other direction, do more in pre-training
dense rewards changes things
research.nvidia.com/labs/adlr/RLP/
RLP: Reinforcement Learning in Pre-Training
View original thread
29
3
it works too. look at this
isolated impact of RTP
it also seems to hold and maybe even improve with increased model size
isolated impact of RTP
it also seems to hold and maybe even improve with increased model size
4
how it works?
it’s still next token prediction, but it’s allowed to break out of response-mode into thinking mode, just for a single token
dense rewards — again, still next token prediction, but it can get extra reward for CoT tokens that yield better predictive power
it’s still next token prediction, but it’s allowed to break out of response-mode into thinking mode, just for a single token
dense rewards — again, still next token prediction, but it can get extra reward for CoT tokens that yield better predictive power
4