Bluesky Thread

Qwen3-Next-80B-A3B Base, Instruct & Thinking

September 11, 2025 View original thread

Qwen3-Next-80B-A3B Base, Instruct & Thinking

- performs similar to Qwen3-235B-A22B
- 10% the training cost of Qwen3-32B
- 10x throughput of -32B
- outperforms Gemini-2.5-flash on some benchmarks
- native MTP for speculative decoding

qwen.ai/blog?id=4074...

The image compares Qwen3 model variants in terms of MMLU accuracy, training cost, and throughput performance.

Left Panel (Scatter Plot):
• Y-axis: MMLU Accuracy (ranging from 80 to 85).
• X-axis: Training Cost (normalized, with Qwen3-32B at 100%).
• Models plotted:
• Qwen3-30B-A3B → 81.38 accuracy, 12.3% cost.
• Qwen3-32B → 83.61 accuracy, 100% cost.
• Qwen3-Next-80B-A3B → 84.72 accuracy, 9.3% cost.
• Arrows highlight:
• Better Performance (green arrow) going from Qwen3-30B-A3B to Qwen3-Next-80B-A3B.
• 10.7× Acceleration (purple arrow) going from Qwen3-32B to Qwen3-Next-80B-A3B.

Right Panel (Bar Charts):
• Prefill Throughput (32K):
• Qwen3-32B → baseline (×1.0).
• Qwen3-30B-A3B → ×5.2.
• Qwen3-Next-80B-A3B → ×10.6.
• Decode Throughput (32K):
• Qwen3-32B → baseline (×1.0).
• Qwen3-30B-A3B → ×3.5.
• Qwen3-Next-80B-A3B → ×10.0.

This visualization highlights that Qwen3-Next-80B-A3B achieves the best MMLU accuracy (84.72), with dramatically lower training cost (only 9.3%) and far higher throughput efficiency (10× or more) compared to Qwen3-32B.

This bar chart compares Qwen3-Next-80B-A3B-Thinking, Gemini-2.5-Flash-Thinking, Qwen3-32B-Thinking, and Qwen3-30B-A3B-Thinking2507 across five benchmarks.

Benchmarks & Scores:
1. SuperGPQA
• Qwen3-Next-80B-A3B-Thinking: 60.8
• Gemini-2.5-Flash-Thinking: 57.8
• Qwen3-32B-Thinking: 54.1
• Qwen3-30B-A3B-Thinking2507: 56.8
2. AIME25
• Qwen3-Next-80B-A3B-Thinking: 87.8
• Gemini-2.5-Flash-Thinking: 72.0
• Qwen3-32B-Thinking: 72.9
• Qwen3-30B-A3B-Thinking2507: 85.0
3. LiveCodeBench v6 (25.02–25.05)
• Qwen3-Next-80B-A3B-Thinking: 68.7
• Gemini-2.5-Flash-Thinking: 61.2
• Qwen3-32B-Thinking: 60.6
• Qwen3-30B-A3B-Thinking2507: 66.0
4. Arena-Hard v2
• Qwen3-Next-80B-A3B-Thinking: 62.3
• Gemini-2.5-Flash-Thinking: 56.7
• Qwen3-32B-Thinking: 48.4
• Qwen3-30B-A3B-Thinking2507: 56.0
5. LiveBench (20241125)
• Qwen3-Next-80B-A3B-Thinking: 76.6
• Gemini-2.5-Flash-Thinking: 74.3
• Qwen3-32B-Thinking: 74.9
• Qwen3-30B-A3B-Thinking2507: 76.8

Key Takeaways:
• Qwen3-Next-80B-A3B-Thinking leads in SuperGPQA, AIME25, LiveCodeBench v6, and Arena-Hard v2.
• On LiveBench (20241125), Qwen3-30B-A3B-Thinking2507 slightly edges out with 76.8 vs. 76.6.
• Gemini-2.5-Flash-Thinking is consistently competitive but lags behind the Qwen3-Next model.

26 3

this really is a big deal. in agent workloads, the context jumps up fast. this architecture was designed from the ground up for that scenario

This line graph compares Prefill Throughput vs Sequence Length (Normalized) for three Qwen3 models: Qwen3-32B (gray dashed line with circles), Qwen3-30B-A3B (green dashed line with circles), and Qwen3-Next-80B-A3B (purple solid line with stars).

Axes:
• X-axis (horizontal): Sequence length from 4K to 128K.
• Y-axis (vertical): Normalized prefill throughput (relative to Qwen3-32B, which is fixed at ~1.0).

Trends:
1. Qwen3-32B (gray line): Flat line at ~1.0 across all sequence lengths (baseline).
2. Qwen3-30B-A3B (green line): Starts at ~6.5 (4K), then gradually declines to ~4.5 by 128K.
3. Qwen3-Next-80B-A3B (purple line): Starts at ~6.8 (4K), steadily rises with sequence length, reaching ~15.5 at 128K.

Key Takeaways:
• Qwen3-Next-80B-A3B demonstrates scaling efficiency, improving throughput as sequence length increases.
• Qwen3-30B-A3B loses throughput efficiency with longer sequences.
• Qwen3-32B serves as a constant baseline (1.0).

👉 This strongly reinforces the earlier results: Qwen3-Next-80B-A3B is not only faster but also scales better at longer sequence lengths, making it superior for long-context tasks.

1 hour later

hmm ngl i don’t like this Qwen either. it’s got a similar vibe that i just do not like

bsky.app/profile/timk...

Tim Kellogg @timkellogg.me

can an AI be an asshole? this one might be an asshole

More like this