Bluesky Thread

these graphs are nuts

View original thread
these graphs are nuts

bf16 has been the only way training has been done for nearly a decade

all this year tons of resources have been dumped into RL, and this is saying most of that was wasted bc we chose the “cool” float format
Tim Kellogg @timkellogg.me
astonishing: using fp16 instead of bf16 results in more stable training runs as well as a smaller performance gap between training & inference

this is critical for RL, which is mostly inference and very sensitive to reproducible results
A 3×4 grid of line charts comparing BF16 (blue) and FP16 (green) across various training setups.
Each subplot shows Training Steps (x-axis) vs. a performance metric (y-axis, 0.0–1.0).

Top row:
(a) Sanity GRPO – FP16 consistently outperforms BF16; smooth upward trend to ~0.95.
(b) Sanity GRPO-Token-TIS – similar behavior; FP16 steadier and higher.
(c) Sanity GRPO-Seq-MIS – FP16 reaches ~0.95; BF16 lags below 0.85.
(d) Sanity GSPO – FP16 again higher and smoother; BF16 converges slower.

Middle row:
(e) Sanity PG-Seq-IS and (f) Sanity PG-Seq-MIS – both show FP16 > BF16; FP16 reaches near 1.0.
(g) OctoThinker GRPO – FP16 stable near 1.0; BF16 spikes early, then collapses.
(h) Lora GRPO-Token-TIS – FP16 stable around 0.8; BF16 fluctuates sharply and drops after ~800 steps.

Bottom row:
(i) MoE GRPO-Seq-MIS, (j) MoE GRPO-Token-TIS, and (k) MoE PG-Seq-TIS – both precisions close; FP16 slightly faster early on.
(l) Dense-14B DAPO – both curves rise smoothly; FP16 maintains a small advantage.

In nearly all cases, FP16 (green) trains more stably and converges to higher performance than BF16 (blue).
54 3
there’s researchers on X like

r1: “can this even be replicated??”

r2: “yeah, trivially, look here..”
8
54 likes 3 reposts

More like this

×