Bluesky Thread

RLP: Reinforcement Learning in Pre-Training

October 01, 2025 View original thread

RLP: Reinforcement Learning in Pre-Training

an NVIDIA paper explores using dense verifier-free RL in pretraining

this feels significant. Everything else I’ve been seeing is moving the other direction, do more in pre-training

dense rewards changes things

research.nvidia.com/labs/adlr/RLP/

$The image is a diagram and explanation titled “How does RLP work?” It is split into three main sections: ⸻ Left: Overview of RLP (diagram inside a dashed box) • Input text fragment x_{<t} goes into two paths: 1. No-Think baseline (blue box), which provides a reference baseline distribution p_{EMA}(\cdot | x_{<t}). 2. Thought Policy (yellow box), which generates a chain-of-thought trace (z_1, \ldots, z_G) and the next token. • The outputs are compared: the model prediction conditioned on the chain-of-thought (p_\theta(\cdot | x_{<t}, z_t)) versus the no-think baseline. • This yields Information Gain Reward, a dense non-binary reward (r_1, \ldots, r_G). At the bottom, a label says: “CoT trace + next token.” ⸻ Middle: Next Token Prediction A blue box with a plain model output: “Photosynthesis is the process plants, algae, and some bacteria use to make their own food using sunlight” Here, “sunlight” is in bold red. A simple diagram of a plant, sun, and arrows illustrates photosynthesis. ⸻ Right: RLP A green box showing the model with explicit chain-of-thought reasoning: “Photosynthesis is the process plants, algae, and some bacteria use to make their own food using” The sentence describes how plants, algae, and bacteria make food. Common knowledge says this process relies on energy from the sun. So the next token is most likely “sunlight.” sunlight This highlights that with RLP, the model generates internal reasoning (in red) before predicting the next token. ⸻ Bottom Caption Figure 2: Visualization of the RLP framework. A chain-of-thought is sampled before next-token prediction. Rewards are computed by contrasting the predictor conditioned on the CoT with a No-think EMA baseline, yielding a verifier-free, dense signal.$

29 3

it works too. look at this

isolated impact of RTP

it also seems to hold and maybe even improve with increased model size

The bar chart compares average accuracies between two setups: Base (light blue) and Base+RLP (dark green).
• Math:
• Base: 61
• Base+RLP: 65
• Science:
• Base: 35
• Base+RLP: 57
• Science Pass@1[4]:
• Base: 33
• Base+RLP: 61
• Overall:
• Base: 47
• Base+RLP: 63

Observation:
Across all categories (Math, Science, Science Pass@1[4], Overall), applying RLP significantly improves accuracy. The largest gain appears in Science Pass@1[4], jumping from 33 to 61. The smallest but still positive gain is in Math, from 61 to 65.

how it works?

it’s still next token prediction, but it’s allowed to break out of response-mode into thinking mode, just for a single token

dense rewards — again, still next token prediction, but it can get extra reward for CoT tokens that yield better predictive power

More like this