Bluesky Thread

🚨New Thinking Machines post🚨

October 27, 2025 View original thread

🚨New Thinking Machines post🚨

i’m sorry, but you can’t skip TM blog posts, those are the rules

this one is a phenomenal description of the strengths & weaknesses of RL vs SFT (supervised fine tuning)

thinkingmachines.ai/blog/on-poli...

thinkingmachines.ai

On-Policy Distillation

On-policy, dense supervision is a useful tool for distillation

55 7

great analogy — chess

RL is like learning chess without any coaching

SFT is like only watching a grandmaster. You’re supposed to learn from game states that a novice will never see!

if only there was a compromise..

(btw when we say “RL has sparse rewards”, that’s what we mean)

it’s called on-policy distillation

e.g. in chess, the instructor would rate all your moves on a scale from “blunder” to “brilliant”

so now, you can lose a match and still learn A LOT

dense rewards

7 2

in LLM training, the student (model being trained) generates tokens (e.g. a RL task) and the teacher model rates every token in the sequence

A horizontal sequence of rounded rectangular tokens, shaded from deep red to near white, with gray italic labels: “student trajectory” above and “teacher’s conditional per-token probability” below (spanning brace). The tokens read:

“5 + 2 is 7 , and 7 × 3 is 21 .”

Beneath each token is a percentage (in red/black):

5 — 40%
• — 80%
2 — 5%
is — 15%
7 — 99%
, — 99%
and — 40%
7 — 80%
× — 90%
3 — 99%
is — 70%
21 — 99%
. — 99%

Color intensity corresponds to the shown probabilities (low percentages in darker red, high percentages pale/white).

how does this work?

if you’ve seen speculative decoding then you’re there already

in both cases, the “draft” model is the cheap small one. in SD it’s just used to fallback to the big model, but in on-policy distillation the larger teacher is used to update weights

research.google/blog/looking...

research.google

Looking back at speculative decoding

20 hours later

Great discussion here

bsky.app/profile/soci...

brendan chambers @societyoftrees.bsky.social

Less discussed though , their choice of reverse-KL is also worth noting in particular.

- It frees the student to drop modes (since divergence = 0 where the student model has no coverage)
- It adds optimization pressure even where the teacher distribution has no coverage (e.g. sampling noise)

More like this