Bluesky Thread

Opus 4.1 Released

View original thread
Opus 4.1 Released

www.anthropic.com/news/claude-...
Bar chart comparing the accuracy of two AI models on the SWE-bench Verified benchmark, labeled “Software engineering.”
	•	The y-axis is labeled “ACCURACY” and ranges from 50 to 80.
	•	Two vertical bars are shown:
	•	Left bar (light beige) represents Opus 4 (May 2025) with an accuracy of 72.5%
	•	Right bar (darker orange) represents Opus 4.1 (Aug 2025) with an accuracy of 74.5%
	•	The Opus 4.1 bar is slightly taller, indicating an improvement of 2 percentage points over Opus 4.
22 2
4 hours later
Opus 4.1 seems like a incremental improvement, mostly across the board on benchmarks

i’ve heard chatter that it’s a bit of a regression. i suppose that’s inevitable given only marginal improvements

still, i’m excited to push its agency a bit more
Performance comparison table of AI models across seven benchmarks. The columns represent different models: Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4, OpenAI o3, and Gemini 2.5 Pro. Claude Opus 4.1 is highlighted with a red border.

Task	Claude Opus 4.1	Claude Opus 4	Claude Sonnet 4	OpenAI o3	Gemini 2.5 Pro
Agentic coding (SWE-bench Verified)	74.5%	72.5%	72.7%	69.1%	67.2%
Agentic terminal coding (Terminal-Bench)	43.3%	39.2%	35.5%	30.2%	25.3%
Graduate-level reasoning (GPQA Diamond)	80.9%	79.6%	75.4%	83.3%	86.4%
Agentic tool use (TAU-bench, Retail / Airline)	82.4% / 56.0%	81.4% / 59.6%	80.5% / 60.0%	70.4% / 52.0%	—
Multilingual Q&A (MMLU, 5x non-English avg)	89.5%	88.8%	86.5%	88.8%	—
Visual reasoning (MMMU, validation)	77.1%	76.5%	74.4%	82.9%	82.0%
High school math competition (AIME 2025)	78.0%	75.5%	70.5%	88.9%	88.0%

Notes:
	•	Claude 4.1 leads in agentic coding and multilingual Q&A.
	•	Gemini 2.5 Pro leads in graduate-level reasoning and high school math.
	•	OpenAI o3 has top scores in visual reasoning and math.
	•	Claude models dominate in agentic tool use.
7
22 likes 2 reposts

More like this

×