DeepSeek 3.2
2 new models:
* 3.2: a open weights GPT-5-High competitor that’s fully agentic
* 3.2-Speciale: a maxxed-out version of 3.2 that achieves IMO Gold. Currently API-only with no tools
so that’s 2 DeepSeek models that achieve IMO Gold
huggingface.co/deepseek-ai/...
DeepSeek 3.2
View original thread
78
16
Their main contributions, as they declare in the tech report:
1. DSA: linear-cost attention
2. a scalable RL framework
3. large scale agentic task synthesis pipeline
DSA was introduced in the 3.2-Exp tech report, which is the direct predecessor to 3.2
github.com/deepseek-ai/...
1. DSA: linear-cost attention
2. a scalable RL framework
3. large scale agentic task synthesis pipeline
DSA was introduced in the 3.2-Exp tech report, which is the direct predecessor to 3.2
github.com/deepseek-ai/...
14
1
3.2-Speciale is NOT open weights, or at least not yet. They don’t give a good reason, other than noting that they’re “supporting the community and research”
In the tech report, they note that 3.2-Soeciale has a reduced length penalty in RL
So i’m not clear why they’re withholding it
In the tech report, they note that 3.2-Soeciale has a reduced length penalty in RL
So i’m not clear why they’re withholding it
8
they discuss RL environments, which there is still not a lot discussed publicly on RL envs, so much appreciated
note: this is different from the original K2, where they trained on thousands of synthetic MCP tools
RL envs are built for repeatability
note: this is different from the original K2, where they trained on thousands of synthetic MCP tools
RL envs are built for repeatability
2
oh! i’ve never seen a tech report do this before
comparison of scores & token utilization
they note that 3.2-Speciale is too verbose for real deployment and that this is a significant area for future research
comparison of scores & token utilization
they note that 3.2-Speciale is too verbose for real deployment and that this is a significant area for future research
11
1
in the conclusion they note limitations
PRE-TRAINING IS NOT DEAD
their strategy for a V4 seems to be around scaling up pre-training compute, and doing it in a more efficient and dense manner
PRE-TRAINING IS NOT DEAD
their strategy for a V4 seems to be around scaling up pre-training compute, and doing it in a more efficient and dense manner
7
1