Bluesky Thread

openai researcher posts on X (not a blog or paper) about a model they have th...

View original thread
openai researcher posts on X (not a blog or paper) about a model they have that can win the International Math Olympiad

you can’t verify anything he says, but he’s totally telling the truth
Alexander Wei
@alexwei_
8/N Btw, we are releasing GPT-5 soon, and we're excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don't plan to release anything with this level of math capability for several months.
29 2
tbf he did post the work for a few cherry picked problems

github.com/aw31/openai-...

i think the interesting part here (if true) is they used general RL, not math-specific RL, combined with test-time compute scaling
github.com
GitHub - aw31/openai-imo-2025-proofs
Contribute to aw31/openai-imo-2025-proofs development by creating an account on GitHub.
7
when they say there using test time scaling, they’re **really** using it
Noam Brown @polynoamial
Also this model thinks for a *long* time. 01 thought for seconds. Deep Research for minutes.
This one thinks for hours. Importantly, it's also more efficient with its thinking. And there's a lot of room to push the test-time compute and efficiency further.
11 1
afaict they’re not using Lean or any other proof assistant at runtime, they say “without tools”

seems like a big hit to Gary Marcus’ belief in neurosymbolic reasoning
11 1
RL does better when you give partial credit, that was R1’s finding

this theory suggests they trained on extremely hard problems and created extremely tailored ways of giving partial credit, to incentivize good behavior, in a way that’s far too specific to scale

the innovation is scaling it
Teknium (e/^)
@Teknium1

My best guess:
Rubrics + LLM Judge - Atomize each point in the ground truth proof and check against the model output
My guess on how they made this scalable - as before it was not, humans had to meticulously craft them, is they trained or did something to make very good rubrics generated for each specific problem or its answer.

QT
Alexander Wei
@alexwei_ • 5h
5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.
8 1
i’m not sure if that’s true — Noam said they didn’t train only on Math problems

i suppose this really is a generic method though. if you can create highly detailed and customized rubrics for math, maybe you can do it for any domain
8
3 hours later
similar take, verifier is the innovation, but this one says there was basically no RL (instead of all RL)

verifiers are like that. they can be used for test time reasoning, or for RL training
wh@@nrehiew_•3h
Takeaways + (guesses):
1) This is likely a multi agent system. So it isn't a single reasoner thinking for a million tokens in one go
2) (This likely doesn't use much training compute if at all)
3) They have a general purpose verifier beyond just rule based final answer checking. (seed thinking verifier style on proofs/cot directly)
4) They think this verifier (even though likely has the lIm as a judge form) is extremely hard to hack
1
9 hours later
interesting example of compression = intelligence

they must have fine tuned the IMO model to be extremely succinct
Dave
@dmvaldman • 50m
••.
A striking thing about OpenAl's IMO gold math model is how terse it is, it really tries to express itself in single tokens. Often breaking the rules of grammar and spelling to do so. They say compression is intelligence. We may be seeing a totally novel way to do compression here!
Some examples:
not divisible by3
(saves a token by not including a space "by 3")
Let w= circumcircle
(saves a token, w=circumcircle is 5 tokens where including a space on just one side makes it 4 tokens)
Need show also all terms multiple of 3. (saves a token by not pluralizing "multiple")
And it marks progress using single token words like: perfect, good, full, exactly)
3
29 likes 2 reposts

More like this

×