Bluesky Thread

Qwen3-4B Instruct & Thinking

View original thread
Qwen3-4B Instruct & Thinking

uuuh, guys this isn’t a boring model

This crushes all the agentic benchmarks, even beating out the already-impressive qwen3-30b-a4b

It’s hanging with some already impressive mid-sized models at only 4B
Bar chart comparing performance of four Qwen3-4B model variants—Thinking-2507 (red), Thinking (gray), Instruct-2507 (blue), and Non-Thinking (beige)—across five benchmarks:
	1.	GPQA
	•	Thinking-2507: 65.8
	•	Thinking: 62.0
	•	Instruct-2507: 55.9
	•	Non-Thinking: 41.7
	2.	AIME25
	•	Thinking-2507: 81.3
	•	Thinking: 65.6
	•	Instruct-2507: 47.4
	•	Non-Thinking: 19.1
	3.	LiveCodeBench v6 (25.02–25.05)
	•	Thinking-2507: 55.2
	•	Thinking: 48.4
	•	Instruct-2507: 35.1
	•	Non-Thinking: 26.4
	4.	Arena-Hard v2
	•	Thinking-2507: 34.9
	•	Instruct-2507: 43.4
	•	Thinking: 13.7
	•	Non-Thinking: 9.5
	5.	BFCL-v3
	•	Thinking-2507: 71.2
	•	Thinking: 65.9
	•	Instruct-2507: 61.9
	•	Non-Thinking: 57.6

Across all benchmarks, the Thinking-2507 variant (red bars) consistently leads in performance except for Arena-Hard v2, where Instruct-2507 scores highest. A watermark-style logo is faintly visible in the background behind the bars.
55 4
huggingface.co/Qwen/Qwen3-4...
huggingface.co
Qwen/Qwen3-4B-Thinking-2507 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
4 1
55 likes 4 reposts

More like this

×