🚨New Thinking Machines post🚨
i’m sorry, but you can’t skip TM blog posts, those are the rules
this one is a phenomenal description of the strengths & weaknesses of RL vs SFT (supervised fine tuning)
thinkingmachines.ai/blog/on-poli...
🚨New Thinking Machines post🚨
View original threadgreat analogy — chess
RL is like learning chess without any coaching
SFT is like only watching a grandmaster. You’re supposed to learn from game states that a novice will never see!
if only there was a compromise..
(btw when we say “RL has sparse rewards”, that’s what we mean)
RL is like learning chess without any coaching
SFT is like only watching a grandmaster. You’re supposed to learn from game states that a novice will never see!
if only there was a compromise..
(btw when we say “RL has sparse rewards”, that’s what we mean)
10
it’s called on-policy distillation
e.g. in chess, the instructor would rate all your moves on a scale from “blunder” to “brilliant”
so now, you can lose a match and still learn A LOT
dense rewards
e.g. in chess, the instructor would rate all your moves on a scale from “blunder” to “brilliant”
so now, you can lose a match and still learn A LOT
dense rewards
7
2
in LLM training, the student (model being trained) generates tokens (e.g. a RL task) and the teacher model rates every token in the sequence
7
how does this work?
if you’ve seen speculative decoding then you’re there already
in both cases, the “draft” model is the cheap small one. in SD it’s just used to fallback to the big model, but in on-policy distillation the larger teacher is used to update weights
research.google/blog/looking...
if you’ve seen speculative decoding then you’re there already
in both cases, the “draft” model is the cheap small one. in SD it’s just used to fallback to the big model, but in on-policy distillation the larger teacher is used to update weights
research.google/blog/looking...
1