Bluesky Thread

Qwen Tongyi Deep Research

September 16, 2025 View original thread

Qwen Tongyi Deep Research

a 32B model that beats other SOTA deep research on many benchmarks

tongyi-agent.github.io/blog/introdu...

The image shows eight grouped bar charts comparing benchmark performance across different models. Each subplot corresponds to a benchmark, with models indicated by their logos or text labels. Bars are colored purple (Qwen3.5-Dragon-32B-DeepSearch) and gray/black for other models. Scores are reported as Average F1 or Average@3 depending on the benchmark.

⸻

Benchmarks and Results:

1. Humanity’s Last Exam
• Qwen3.5-Dragon-32B-DeepSearch: 32.9
• DeepSeek: 29.8
• Kimi: 26.9
• Gemini 2.5: 26.9
• Claude: 21.2
• Others lower

2. BrowseComp
• Qwen3.5-Dragon-32B-DeepSearch: 51.5
• DeepSeek: 43.4
• Others range 30.0 → 12.2

3. BrowseComp-ZH
• Qwen3.5-Dragon-32B-DeepSearch: 46.7
• DeepSeek: 42.9
• Claude: 37.5
• Kimi: 29.1
• Claude Sonnet: 28.8

4. WebWalkerQA
• Qwen3.5-Dragon-32B-DeepSearch: 72.2
• DeepSeek: 71.7
• Claude: 65.6
• Kimi: 63.0
• GPT-4.1: 61.7
• Claude 3.5 Sonnet: 61.2

5. GAIA
• Qwen3.5-Dragon-32B-DeepSearch: 70.9
• Claude: 68.3
• DeepSeek: 67.4
• GLM-4.5: 66.0
• GPT-4.1: 63.1
• Kimi: 57.7

6. xbench-DeepSearch
• Qwen3.5-Dragon-32B-DeepSearch: 75.0
• DeepSeek: 71.0
• GLM-4.5: 70.0
• Kimi: 67.0
• Claude Sonnet: 65.0

7. FRAMES
• Qwen3.5-Dragon-32B-DeepSearch: 90.6
• DeepSeek: 83.7
• GPT-4.1: 83.0
• GLM-4.5: 78.9
• Kimi: 78.8
• Claude Sonnet: 72.0

8. SimpleQA
• Qwen3.5-Dragon-32B-DeepSearch: 98.6
• Kimi: 93.6
• DeepSeek: 93.5
• GPT-4.1: 88.3
• Gemini 2.5: 55.1
• Others ~50

⸻

Overall:
Qwen3.5-Dragon-32B-DeepSearch (purple bars) consistently outperforms or matches all other models across benchmarks, especially dominating FRAMES (90.6) and SimpleQA (98.6).

25 3

- fully synthetic
- continued pre-training (mid) on agent traces

i’ve been saying this! data! data! data!

Continual Pre-training a nd Post-training Empowered by Fully Synthetic Data
The model's capabilities are built upon a novel, multi-stage data strategy designed to cre ate vast and high-quality agentic training dat a without relying on costly human annotation.

this is for dataset construction 🤯

To address complex, high-uncertainty questi ons, we synthesize web-based QA data thro ugh a novel pipeline. The process begins by constructing a highly interconnected know ledge graph via random walks and isomor phic tables towards tabular data fusion fro m real-world websites, ensuring a realistic in formation structure. We then sample subgra phs and subtables to generate initial questio ns and answers. The crucial step involves int entionally increasing difficulty by strategicall y obfuscating or blurring information within t he question. This practical approach is groun ded in a complete theoretical framework, wh ere we formally model QA difficulty as a serie s of controllable "atomic operations" (e.g., m erging entities with similar attributes) on entit y relationships, allowing us to systematically i ncrease complexity.

from github: they describe how they synthesize successful single step traces, and then combine them together into successful multi-step traces

🤯

generally, long traces = higher quality

they can sythesize their way into the highest quality

github.com/Alibaba-NLP/...

FAS - Reasoning Action
Synthesis
By combining questions with their knowledge sources, we emulate the process of deriving final answers through logical inference under fully informed conditions, strengthening the agent's reasoning capability.
HAS — Decision-Making
Action Synthesis
We reformulate the agent trajectories as multi-step decision-making processes, fully exploring the reasoning-action space at each step.
HAS expands the agent's capacity to explore the action-answer space while enhancing its decision-making abilities.
Original Trajectory
• R,
Step-level Scaling,
Question
Solution [31
High ceder Action Synthesis Trajectory
Solution (21
is [Correct)

btw i fucked up the labeling earlier. this is a 30B-A3B, so yeah, extremely compact and cheap

More like this