Bluesky Thread

Opus 4.1 Released

August 05, 2025 View original thread

Opus 4.1 Released

www.anthropic.com/news/claude-...

Bar chart comparing the accuracy of two AI models on the SWE-bench Verified benchmark, labeled “Software engineering.”
• The y-axis is labeled “ACCURACY” and ranges from 50 to 80.
• Two vertical bars are shown:
• Left bar (light beige) represents Opus 4 (May 2025) with an accuracy of 72.5%
• Right bar (darker orange) represents Opus 4.1 (Aug 2025) with an accuracy of 74.5%
• The Opus 4.1 bar is slightly taller, indicating an improvement of 2 percentage points over Opus 4.

22 2

4 hours later

Opus 4.1 seems like a incremental improvement, mostly across the board on benchmarks

i’ve heard chatter that it’s a bit of a regression. i suppose that’s inevitable given only marginal improvements

still, i’m excited to push its agency a bit more

Performance comparison table of AI models across seven benchmarks. The columns represent different models: Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4, OpenAI o3, and Gemini 2.5 Pro. Claude Opus 4.1 is highlighted with a red border.

Task Claude Opus 4.1 Claude Opus 4 Claude Sonnet 4 OpenAI o3 Gemini 2.5 Pro
Agentic coding (SWE-bench Verified) 74.5% 72.5% 72.7% 69.1% 67.2%
Agentic terminal coding (Terminal-Bench) 43.3% 39.2% 35.5% 30.2% 25.3%
Graduate-level reasoning (GPQA Diamond) 80.9% 79.6% 75.4% 83.3% 86.4%
Agentic tool use (TAU-bench, Retail / Airline) 82.4% / 56.0% 81.4% / 59.6% 80.5% / 60.0% 70.4% / 52.0% —
Multilingual Q&A (MMLU, 5x non-English avg) 89.5% 88.8% 86.5% 88.8% —
Visual reasoning (MMMU, validation) 77.1% 76.5% 74.4% 82.9% 82.0%
High school math competition (AIME 2025) 78.0% 75.5% 70.5% 88.9% 88.0%

Notes:
• Claude 4.1 leads in agentic coding and multilingual Q&A.
• Gemini 2.5 Pro leads in graduate-level reasoning and high school math.
• OpenAI o3 has top scores in visual reasoning and math.
• Claude models dominate in agentic tool use.

More like this