Bluesky Thread

Sonnet 4.5

September 29, 2025 View original thread

Sonnet 4.5

Better than Opus 4.1 on almost every benchmark

Still the classic Sonnet prices, $3/$15

This bar chart shows Software engineering performance on SWE-bench Verified (n=500), comparing several models’ accuracy (%).

Results:
• Sonnet 4.5: 77.2% (base), 82.0%* with parallel test-time compute
• Opus 4.1: 74.5% (base), 79.4%* with parallel compute
• Sonnet 4: 72.7% (base), 80.2%* with parallel compute
• GPT-5 Codex: 74.5%
• GPT-5: 72.8%
• Gemini 2.5 Pro: 67.2%

(*asterisk indicates results with parallel test-time compute scaling).

Key takeaways:
• Sonnet 4.5 achieves the highest overall score (82.0% with scaling).
• Without scaling, it still leads at 77.2%.
• Opus 4.1 and Sonnet 4 gain significant boosts from scaling, moving them close to Sonnet 4.5.
• GPT-5 Codex and GPT-5 are competitive (~73–75%), but below Sonnet/Opus.
• Gemini 2.5 Pro lags furthest behind at 67.2%.

This table compares Claude Sonnet 4.5, Claude Opus 4.1, Claude Sonnet 4, GPT-5, and Gemini 2.5 Pro across a wide range of benchmarks.

⸻

Agentic coding (SWE-bench Verified)
• Claude Sonnet 4.5: 77.2% (82.0% with parallel compute)
• Claude Opus 4.1: 74.5% (79.4% with parallel compute)
• Claude Sonnet 4: 72.7% (80.2% with parallel compute)
• GPT-5: 72.8% (74.5% Codex)
• Gemini 2.5 Pro: 67.2%

⸻

Agentic terminal coding (Terminal-Bench)
• Claude Sonnet 4.5: 50.0%
• Claude Opus 4.1: 46.5%
• Claude Sonnet 4: 36.4%
• GPT-5: 43.8%
• Gemini 2.5 Pro: 25.3%

⸻

Agentic tool use (τ²-bench)

Retail: Sonnet 4.5 (86.2%), Opus 4.1 (86.8%), Sonnet 4 (83.8%), GPT-5 (81.1%)
Airline: Sonnet 4.5 (70.0%), Opus 4.1 (63.0%), Sonnet 4 (63.0%), GPT-5 (62.6%)
Telecom: Sonnet 4.5 (98.0%), Opus 4.1 (71.5%), Sonnet 4 (49.6%), GPT-5 (96.7%)

⸻

Computer use (OSWorld)
• Sonnet 4.5: 61.4%
• Opus 4.1: 44.4%
• Sonnet 4: 42.2%
• GPT-5: —
• Gemini 2.5 Pro: —

⸻

High school math (AIME 2025)

Python: Sonnet 4.5 (100%), GPT-5 (99.6%), Gemini 2.5 Pro (88.0%)
No tools: Sonnet 4.5 (87.0%), Opus 4.1 (78.0%), Sonnet 4 (70.5%), GPT-5 (94.6%)

⸻

Graduate-level reasoning (GPQA Diamond)
• Sonnet 4.5: 83.4%
• Opus 4.1: 81.0%
• Sonnet 4: 76.1%
• GPT-5: 85.7%
• Gemini 2.5 Pro: 86.4%

⸻

Multilingual Q&A (MMLU)
• Sonnet 4.5: 89.1%
• Opus 4.1: 89.5%
• Sonnet 4: 86.5%
• GPT-5: 89.4%
• Gemini 2.5 Pro: —

⸻

Visual reasoning (MMMU validation)
• Sonnet 4.5: 77.8%
• Opus 4.1: 77.1%
• Sonnet 4: 74.4%
• GPT-5: 84.2%
• Gemini 2.5 Pro: 82.0%

⸻

Financial analysis (Finance Agent)
• Sonnet 4.5: 55.3%
• Opus 4.1: 50.9%
• Sonnet 4: 44.5%
• GPT-5: 46.9%
• Gemini 2.5 Pro: 29.4%

⸻

Key insights
• Claude Sonnet 4.5 dominates in coding (SWE-bench, Terminal, τ²-bench), computer use, and finance.
• GPT-5 is very strong in math (no tools), visual reasoning, and GPQA Diamond.
• Gemini 2.5 Pro underperforms overall, but is competitive in graduate-level reasoning and

karashiiro @karashiiro.moe

They were close, but not close enough www.anthropic.com/news/claude-...

26 4

A bigger deal — Claude Agent SDK

Ever wanted Claude Code but for your domain? Well now you have it, fully

www.anthropic.com/engineering/...

www.anthropic.com

Building agents with the Claude Agent SDK

How to get started with the Claude Agent SDK and best practices for using it.

Memory Editing API

It automatically garbage collects parts of the context that haven’t been used in a while. (Pic looks like compact, blog says remove, idk)

yooooo! this is what i want from a mechinterp-obsessed lab. Nice!

www.anthropic.com/news/context...

This diagram illustrates context editing for tool-using language models.

⸻

Before context editing (top bar)
• The sequence shows Tool Use 1 → Tool Result 1 → … → Tool Use N → Tool Result N → Tool Use N+1 → Tool Result N+1 …
• All tool calls and their results are preserved in the context window, consuming available space.
• This limits the room left for new context or reasoning.

⸻

After context editing (bottom bar)
• Earlier Tool Use / Tool Result pairs are compressed, summarized, or pruned.
• The context retains recent tool interactions (N, N+1, …) in detail.
• Freed-up space is marked as “Available Context” (green).

⸻

Key insight

Context editing makes the model more efficient by reducing redundant past information and preserving only the most relevant or summarized traces of tool use. This creates more space for new instructions, reasoning, or tool calls.

It’s essentially a memory optimization strategy for long-running agentic workflows.

Imagine with Claude

generative UI? like for real? as the user clicks buttons the UI manifests itself 🤯

youtu.be/dGiqrsv530Y

youtu.be

An experimental new way to design software

YouTube video by Anthropic

12 1

ooo, good to note. GPT-5-Codex still beats it in some areas

but, no point in using Opus anymore

- It's 5x cheaper than Opus: It's still the same pricing as the old Sonnet 4—so there's basically no reason to use Opus in the API anymore. Sonnet all day.
- It's still not as good for tricky production PRs as GPT-5 Codex: It runs much faster, but in my testing GPT-5 Codex beat Sonnet 4.5 at code review. It caught hard-to-find production issues that Sonnet missed.

More like this