Bluesky Thread

yesssss! a small update to Qwen3-30B-A3B

View original thread
yesssss! a small update to Qwen3-30B-A3B

this has been one of my favorite local models, and now we get an even better version!

better instruction following, tool use & coding. Nice small MoE!

huggingface.co/Qwen/Qwen3-3...
Bar chart comparing performance of five models across five benchmarks: GPQA, AIME25, LiveCodeBench v6, Arena-Hard v2, and BFCL-v3. Each model is color-coded:
	•	Red: Qwen3-30B-A3B-Instruct-2507
	•	Blue: Qwen3-30B-A3B Non-thinking
	•	Gray: Qwen3-235B-A22B Non-thinking
	•	Gold: Gemini-2.5-Flash Non-thinking
	•	Beige: OpenAI GPT-4o-0327

Benchmark results (highest score in each bolded):
	•	GPQA:
	•	Qwen3-30B-A3B-Instruct-2507: 70.4
	•	Qwen3-30B-A3B Non-thinking: 54.8
	•	Qwen3-235B-A22B: 62.9
	•	Gemini-2.5-Flash: 78.3
	•	GPT-4o-0327: 66.9
	•	AIME25:
	•	Qwen3-30B-A3B-Instruct-2507: 61.3
	•	Qwen3-30B-A3B Non-thinking: 21.6
	•	Qwen3-235B-A22B: 24.7
	•	Gemini-2.5-Flash: 61.6
	•	GPT-4o-0327: 66.7
	•	LiveCodeBench v6:
	•	Qwen3-30B-A3B-Instruct-2507: 43.2
	•	Qwen3-30B-A3B Non-thinking: 29.0
	•	Qwen3-235B-A22B: 32.9
	•	Gemini-2.5-Flash: 40.1
	•	GPT-4o-0327: 35.8
	•	Arena-Hard v2:
	•	Qwen3-30B-A3B-Instruct-2507: 69.0
	•	Qwen3-30B-A3B Non-thinking: 24.8
	•	Qwen3-235B-A22B: 52.0
	•	Gemini-2.5-Flash: 58.3
	•	GPT-4o-0327: 61.9
	•	BFCL-v3:
	•	Qwen3-30B-A3B-Instruct-2507: 65.1
	•	Qwen3-30B-A3B Non-thinking: 58.6
	•	Qwen3-235B-A22B: 68.0
	•	Gemini-2.5-Flash: 64.1
	•	GPT-4o-0327: 66.5

Note: The red “Instruct” model consistently outperforms its blue “Non-thinking” counterpart. GPT-4o and Gemini-2.5-Flash also show strong overall results. Arena-Hard v2 notes GPT-4.1 as evaluator.
46 4
sadly, i have to kill off an existing model in order to load this one. which should go?
NAME                                                 ID              SIZE      MODIFIED
nomic-embed-text:latest                              0a109f422b47    274 MB    6 days ago
gemma3n:latest                                       15cb39fd9394    7.5 GB    4 weeks ago
hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:BF16    ac117f48692e    16 GB     2 months ago
qwen3:30b                                            2ee832bc15b5    18 GB     3 months ago
gemma3:27b                                           30ddded7fba6    17 GB     4 months ago
qwen2.5:latest                                       845dbda0ea48    4.7 GB    4 months ago
deepseek-r1:32b                                      38056bbcbb2d    19 GB     5 months ago
deepseek-r1:70b                                      0c1615a8ca32    42 GB     5 months ago
deepseek-r1:latest                                   0a8c26691023    4.7 GB    5 months ago
4
2 hours later
left: old checkpoint
right: new checkpoint
Benchmark table comparing model performance across 10 tasks. The models evaluated are:
	•	Qwen3-30B-A3B (MoE)
	•	QwQ-32B
	•	Qwen3-4B (Dense)
	•	Qwen2.5-72B-Instruct
	•	Gemma3-27B-IT
	•	DeepSeek-V3
	•	GPT-4o (2024-11-20)

Metrics by Task:

Benchmark	Qwen3-30B-A3B	QwQ-32B	Qwen3-4B	Qwen2.5-72B	Gemma3-27B	DeepSeek-V3	GPT-4o
ArenaHard	91.0	89.5	76.6	81.2	86.8	85.5	85.3
AIME’24	80.4	79.5	73.8	18.9	32.6	39.2	11.1
AIME’25	70.9	69.5	65.6	15.0	24.0	28.8	7.6
LiveCodeBench	62.6	62.7	54.2	30.7	26.9	33.1	32.7
CodeForces (Elo)	1974	1982	1671	859	1063	1134	864
GPQA	65.8	65.6	55.9	49.0	42.4	59.1	46.0
LiveBench	74.3	72.0	63.6	51.4	49.2	60.5	52.2
BFCL v3	69.1	66.4	65.9	63.4	59.1	57.6	72.5
MultiIF (8 Langs)	72.2	68.3	66.3	65.3	69.8	55.6	65.6

Highlights:
	•	Qwen3-30B-A3B (MoE) leads in most benchmarks including AIME’24, AIME’25, LiveBench, and ArenaHard.
	•	QwQ-32B slightly outperforms in CodeForces Elo and GPQA.
	•	GPT-4o has the top BFCL score and decent scores on MultiIF and ArenaHard.
	•	Gemma3-27B-IT and Qwen2.5-72B-Instruct underperform in AIME tasks.
	•	A note at the bottom clarifies that AIME scores are averages from multiple runs, and some think modes were disabled for efficiency.
Bar chart comparing performance of five models across five benchmarks: GPQA, AIME25, LiveCodeBench v6, Arena-Hard v2, and BFCL-v3. Each model is color-coded:
	•	Red: Qwen3-30B-A3B-Instruct-2507
	•	Blue: Qwen3-30B-A3B Non-thinking
	•	Gray: Qwen3-235B-A22B Non-thinking
	•	Gold: Gemini-2.5-Flash Non-thinking
	•	Beige: OpenAI GPT-4o-0327

Benchmark results (highest score in each bolded):
	•	GPQA:
	•	Qwen3-30B-A3B-Instruct-2507: 70.4
	•	Qwen3-30B-A3B Non-thinking: 54.8
	•	Qwen3-235B-A22B: 62.9
	•	Gemini-2.5-Flash: 78.3
	•	GPT-4o-0327: 66.9
	•	AIME25:
	•	Qwen3-30B-A3B-Instruct-2507: 61.3
	•	Qwen3-30B-A3B Non-thinking: 21.6
	•	Qwen3-235B-A22B: 24.7
	•	Gemini-2.5-Flash: 61.6
	•	GPT-4o-0327: 66.7
	•	LiveCodeBench v6:
	•	Qwen3-30B-A3B-Instruct-2507: 43.2
	•	Qwen3-30B-A3B Non-thinking: 29.0
	•	Qwen3-235B-A22B: 32.9
	•	Gemini-2.5-Flash: 40.1
	•	GPT-4o-0327: 35.8
	•	Arena-Hard v2:
	•	Qwen3-30B-A3B-Instruct-2507: 69.0
	•	Qwen3-30B-A3B Non-thinking: 24.8
	•	Qwen3-235B-A22B: 52.0
	•	Gemini-2.5-Flash: 58.3
	•	GPT-4o-0327: 61.9
	•	BFCL-v3:
	•	Qwen3-30B-A3B-Instruct-2507: 65.1
	•	Qwen3-30B-A3B Non-thinking: 58.6
	•	Qwen3-235B-A22B: 68.0
	•	Gemini-2.5-Flash: 64.1
	•	GPT-4o-0327: 66.5

Note: The red “Instruct” model consistently outperforms its blue “Non-thinking” counterpart. GPT-4o and Gemini-2.5-Flash also show strong overall results. Arena-Hard v2 notes GPT-4.1 as evaluator.
3
me: "if you're flying over the desert in a canoe and your wheels fall off, how many pancakes does it take to cover a doghouse?"

qwen: "It takes exactly as many pancakes as the number of wheels you *wish* you had on your canoe."

i've never gotten that answer from an LLM before 🤯
12
46 likes 4 reposts

More like this

×