Bluesky Thread

DeepSeek is shipping a theorem prover (automate math proofs)

View original thread
DeepSeek is shipping a theorem prover (automate math proofs)

no paper yet, but word is they used MCTS, which would be surprising bc one of my big takeaways from the R1 paper was that MCTS didn’t work and RL alone was enough

huggingface.co/deepseek-ai/...
huggingface.co
deepseek-ai/DeepSeek-Prover-V2-671B · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
32 8
MCTS = Monte Carlo Tree Search = test time compute that searches a tree of possible answers and uses a second “reward model” to verify or rate results

RL = reinforcement learning = use post training to teach it to chain of thought autonomously
13
same general config as V3: huggingface.co/deepseek-ai/...

training script: huggingface.co/deepseek-ai/...

i don’t see MCTS here, but i might have missed it
huggingface.co
configuration_deepseek.py · deepseek-ai/DeepSeek-Prover-V2-671B at main
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
3
the model weights are there, in the open, but so huge i don’t stand a chance at running them
2
previous paper that likely explains how the verifier part of the MCTS flow works

bsky.app/profile/timk...
Tim Kellogg @timkellogg.me
🚨New DeepSeek Model Incoming🚨

but first they release the paper describing generative reward modeling (GRM) via Self-Principled Critique Tuning (SPCT)

looking forward to DeepSeek-GRM!

arxiv.org/abs/2504.02495
A line chart titled “Figure 1: Inference-time scaling performance with different RMs on all tested RM benchmarks” shows performance on the y-axis (ranging from 66.5 to 72.5) and k: #sampled rewards (logscale) on the x-axis, with values from 1 to 32.

Key observations:
	•	DeepSeek-GRM-27B (MetaRM@k) (Ours) is the top performer, shown with a red line and star markers, rising steeply and leveling near 72.5.
	•	DeepSeek-GRM-27B (Voting@k) (Ours) follows, in blue with stars, peaking slightly above 70.5.
	•	GPT-4o (Greedy) is shown as a gray dashed line, sitting just under 71.
	•	Other models, shown in orange, green, brown, and gray lines (scalar or voting methods), plateau between ~66.5 and ~68.5.
	•	LLM-as-a-Judge w/ TokenProb, Skywork-Reward-Gemma-2-27B, and DeepSeek-BTRM-27B are among these lower-performing models.

Caption summary: The plot shows how performance scales with the number of reward samples at inference time. Results are up to 8 samples, with some (DeepSeek models) extrapolated to 32. Models in non-italic font use Gemma-2-27B as their base.
3
32 likes 8 reposts

More like this

×