Bluesky Thread

here it is:

July 22, 2025 View original thread

here it is:

* benchmarks: tough competition with Sonnet-4
* 256K context, expandable to 1M with YaRN

there’s also a CLI forked from gemini-cli

qwenlm.github.io/blog/qwen3-c...

A terminal-style performance comparison table highlights the benchmark results for multiple models across three categories: Agentic Coding, Agentic Browser Use, and Agentic Tool Use. The title “QWEN3-CODER” is shown in large pixel-style font. The focal model, Qwen3-Coder 480B-A35B-Instruct, is shaded in orange for emphasis.

Agentic Coding Benchmarks (Top Section)
• Qwen3-Coder scores highest or close to highest across nearly all agentic coding tasks:
• Terminal-Bench: 37.5
• SWE-bench Verified: 69.6
• w/ OpenHands, 500 turns: 67.0
• w/ Private Scaffolding: 61.8
• Spider2: 31.1
• Other models like Claude Sonnet-4 (70.4 Terminal-Bench, 68.0 SWE-bench Verified) and GPT-4.1 (63.8 SWE-bench Live) show strong performance on individual tasks.

Agentic Browser Use Benchmarks (Middle Section)
• Qwen3-Coder leads with:
• WebArena: 49.9
• Mind2Web: 55.8
• Again outperforming other models including GPT-4.1 and Claude Sonnet-4.

Agentic Tool Use Benchmarks (Bottom Section)
• Qwen3-Coder tops in:
• BFCL-v3: 68.7
• TAU-Bench Retail: 77.5
• TAU-Bench Airline: 60.0
• Competing strongly with Claude Sonnet-4 (BFCL-v3: 73.3, Retail: 80.5).

All scores are numerical performance metrics (exact units not specified), and the model list includes both open and proprietary models. Graphical elements like a gradient top border and dark background give a retro terminal/arcade vibe. The terminal bar shows battery and network activity.

Tim Kellogg @timkellogg.me

Alibaba is likely to drop “Qwen3-Coder-480B-A35B-Instruct” in a few hours

in the i18n files for their mobile app they describe it as:

"a powerful coding-specialized language model excelling in code generation, tool use, and agentic tasks”

should come “tonight” (relative to China)

34 1

on the inside:

* shallower than Qwen3 (62 vs 94 layers)
* more experts (160 vs 128, in direction of K2)
* more attention heads (96 vs 64, opposite of K2)

curious what the thought is..

More like this