DeepSeek-Math-V2: self-verification
Fascinating paper that explores how to RL but focused on process over outcome
It’s sort of similar to a GAN, but with loops for each the generator & verifier as well as an outer loop
github.com/deepseek-ai/...
DeepSeek-Math-V2: self-verification
View original thread
33
the end result is a model that knows the **process** for finding correct results
they argue that this is how you solve unsolved problems — you refine the process itself
it seems likely that OpenAI’s IMO model is probably doing something along these lines. It’s a general self-verification process
they argue that this is how you solve unsolved problems — you refine the process itself
it seems likely that OpenAI’s IMO model is probably doing something along these lines. It’s a general self-verification process
6
a key innovation here is the inclusion of a meta-verifier
it’s an anchor, so its weights aren’t updated. And it doesn’t really have a hard problem. It doesn’t spot mistakes, it only spots bullshit like reward hacking
it’s an anchor, so its weights aren’t updated. And it doesn’t really have a hard problem. It doesn’t spot mistakes, it only spots bullshit like reward hacking
8