Bluesky Thread

🚨Great Paper Alert🚨 GSPO (Group Sequence Policy Optimization)

July 27, 2025 View original thread

🚨Great Paper Alert🚨 GSPO (Group Sequence Policy Optimization)

tbh it's a bit tough on the math, but it's EXCELLENT at explaining the situation

it's an RL algorithm that fixes stability problems with GRPO (R1's algo) to enable training huge models easily

arxiv.org/abs/2507.180...

$Four line charts comparing GSPO (red circles) and GRPO with Routing Replay (blue squares) across training compute. Each subplot has a y-axis for performance and x-axis for training compute: 1. **Top chart: Training Reward** * GSPO steadily outperforms GRPO, starting at \~0.48 and reaching \~0.73, while GRPO plateaus around 0.68. * GSPO shows higher consistency and faster improvement. 2. **Bottom left: AIME’24** * GSPO starts slightly lower (\~68), surpasses GRPO quickly, and peaks above 82. * GRPO tops out around 81. 3. **Bottom center: LiveCodeBench** * Both models begin at \~55. * GSPO leads consistently, ending near 68 while GRPO lags slightly below. 4. **Bottom right: CodeForces** * GSPO starts at \~1780 and climbs steadily to \~2050. * GRPO starts lower (\~1750) and fluctuates around 2000, ending lower than GSPO. Overall, GSPO outperforms GRPO across all benchmarks with better scaling and higher peak performance. Arrows on all axes reinforce increasing compute and results.$

32 3

their innovation, simplified, is a lot like what Moonshot did with MuonClip (Kimi K2 pretraining & posttraining optimizer)

effectively — slow down learning, look at less data at a time, and training gets hella lot more stable

i imagine american labs must have discovered similar things

one thing they do very well with this GSPO paper is explaining (and proving!) the problems with GRPO as a means to show why GSPO works so well

it's such a phenomenal writing style. More papers should read like this. Juxtapose your thing against some other popular thing

More like this