Bluesky Thread

here it is:

View original thread
here it is:

* benchmarks: tough competition with Sonnet-4
* 256K context, expandable to 1M with YaRN

there’s also a CLI forked from gemini-cli

qwenlm.github.io/blog/qwen3-c...
A terminal-style performance comparison table highlights the benchmark results for multiple models across three categories: Agentic Coding, Agentic Browser Use, and Agentic Tool Use. The title “QWEN3-CODER” is shown in large pixel-style font. The focal model, Qwen3-Coder 480B-A35B-Instruct, is shaded in orange for emphasis.

Agentic Coding Benchmarks (Top Section)
	•	Qwen3-Coder scores highest or close to highest across nearly all agentic coding tasks:
	•	Terminal-Bench: 37.5
	•	SWE-bench Verified: 69.6
	•	w/ OpenHands, 500 turns: 67.0
	•	w/ Private Scaffolding: 61.8
	•	Spider2: 31.1
	•	Other models like Claude Sonnet-4 (70.4 Terminal-Bench, 68.0 SWE-bench Verified) and GPT-4.1 (63.8 SWE-bench Live) show strong performance on individual tasks.

Agentic Browser Use Benchmarks (Middle Section)
	•	Qwen3-Coder leads with:
	•	WebArena: 49.9
	•	Mind2Web: 55.8
	•	Again outperforming other models including GPT-4.1 and Claude Sonnet-4.

Agentic Tool Use Benchmarks (Bottom Section)
	•	Qwen3-Coder tops in:
	•	BFCL-v3: 68.7
	•	TAU-Bench Retail: 77.5
	•	TAU-Bench Airline: 60.0
	•	Competing strongly with Claude Sonnet-4 (BFCL-v3: 73.3, Retail: 80.5).

All scores are numerical performance metrics (exact units not specified), and the model list includes both open and proprietary models. Graphical elements like a gradient top border and dark background give a retro terminal/arcade vibe. The terminal bar shows battery and network activity.
Tim Kellogg @timkellogg.me
Alibaba is likely to drop “Qwen3-Coder-480B-A35B-Instruct” in a few hours

in the i18n files for their mobile app they describe it as:

"a powerful coding-specialized language model excelling in code generation, tool use, and agentic tasks”

should come “tonight” (relative to China)
34 1
on the inside:

* shallower than Qwen3 (62 vs 94 layers)
* more experts (160 vs 128, in direction of K2)
* more attention heads (96 vs 64, opposite of K2)

curious what the thought is..
5
34 likes 1 reposts

More like this

×