Bluesky Thread

Opus 4.5

View original thread
Opus 4.5

Now 1/3rd the cost, and SOTA in programming

Like Gemini 3 Pro, people note that it can see a lot deeper into tough problems. That big model smell..

www.anthropic.com/news/claude-...
A clean bar chart titled “Software engineering — SWE-bench Verified (n=500)” showing accuracy scores (in percent) for six models. Each bar is labeled and color-coded, with values displayed above the bars. From left to right:
	•	Opus 4.5 — tall orange bar at 80.9% (highest in the chart)
	•	Sonnet 4.5 — yellow bar at 77.2%
	•	Opus 4.1 — blue bar at 74.5%
	•	Gemini 3 Pro — light gray bar at 76.2%
	•	GPT-5.1-Codex-Max — light gray bar at 77.9%
	•	GPT-5.1 — light gray bar at 76.3%

The y-axis shows Accuracy (%), ranging from 70 to 82, and the x-axis lists the model names along the bottom.
36 2
Benchmarks

- they compared against Gemini 3 👍
- they showed a decent number of benchmarks
- It *actually* does well compared against Gemini
A table comparing multiple AI models across a wide range of benchmarks. The Opus 4.5 column is highlighted in a light red tint. Each row lists a task category on the left, followed by performance percentages for each model: Opus 4.5, Sonnet 4.5, Opus 4.1, Gemini 3 Pro, and GPT-5.1 (with some GPT-5.1 Codex-Max scores shown in smaller text).

Rows and values:

⸻

Agentic coding — SWE-bench Verified
	•	Opus 4.5: 80.9%
	•	Sonnet 4.5: 77.2%
	•	Opus 4.1: 74.5%
	•	Gemini 3 Pro: 76.2%
	•	GPT-5.1: 76.3% (77.9% Codex-Max)

Agentic terminal coding — Terminal-bench 2.0
	•	Opus 4.5: 59.3%
	•	Sonnet 4.5: 50.0%
	•	Opus 4.1: 46.5%
	•	Gemini 3 Pro: 54.2%
	•	GPT-5.1: 47.6% (58.1% Codex-Max)

Agentic tool use — tz-bench

Retail / Telecom results:

Retail:
	•	Opus 4.5: 88.9%
	•	Sonnet 4.5: 86.2%
	•	Opus 4.1: 86.8%
	•	Gemini 3 Pro: 85.3%
	•	GPT-5.1: —

Telecom:
	•	Opus 4.5: 98.2%
	•	Sonnet 4.5: 98.0%
	•	Opus 4.1: 71.5%
	•	Gemini 3 Pro: 98.0%
	•	GPT-5.1: —

Scaled tool use — MCP Atlas
	•	Opus 4.5: 62.3%
	•	Sonnet 4.5: 43.8%
	•	Opus 4.1: 40.9%
	•	Gemini 3 Pro: —
	•	GPT-5.1: —

Computer use — OSWorld
	•	Opus 4.5: 66.3%
	•	Sonnet 4.5: 61.4%
	•	Opus 4.1: 44.4%
	•	Gemini 3 Pro: —
	•	GPT-5.1: —

Novel problem solving — ARC-AGI-2 (Verified)
	•	Opus 4.5: 37.6%
	•	Sonnet 4.5: 13.6%
	•	Opus 4.1: —
	•	Gemini 3 Pro: 31.1%
	•	GPT-5.1: 17.6%

Graduate-level reasoning — GPQA Diamond
	•	Opus 4.5: 87.0%
	•	Sonnet 4.5: 83.4%
	•	Opus 4.1: 81.0%
	•	Gemini 3 Pro: 91.9% (highest in row)
	•	GPT-5.1: 88.1%

Visual reasoning — MMMU (validation)
	•	Opus 4.5: 80.7%
	•	Sonnet 4.5: 77.8%
	•	Opus 4.1: 77.1%
	•	Gemini 3 Pro: —
	•	GPT-5.1: 85.4% (highest in row)

Multilingual Q&A — MMMU
	•	Opus 4.5: 90.8%
	•	Sonnet 4.5: 89.1%
	•	Opus 4.1: 89.5%
	•	Gemini 3 Pro: 91.8%
	•	GPT-5.1: 91.0%

⸻
13 1
system card

assets.anthropic.com/m/64823ba748...

oh, high alignment and low rates of concerning behavior? sounds like bliss
Our safety evaluations found that, overall, Claude Opus 4.5 showed low rates of concerning behavior. We consider it to be our best-aligned frontier model yet, and likely the best-aligned frontier model in the Al industry to date. Nevertheless, there are many
7
Anthropic has been exploring new ways of burning tokens

put differently: you can get Opus high now!
A line chart titled “Software engineering with effort controls — SWE-bench Verified (n=500)” showing how accuracy changes with output token count for two models: Opus 4.5 (orange) and Sonnet 4.5 (yellow).

Opus 4.5 (orange line with three labeled points):
	•	Low effort: ~4,000 tokens, 75% accuracy
	•	Medium effort: ~6,000 tokens, 78% accuracy
	•	High effort: ~12,000 tokens, 81% accuracy

A smooth upward-sloping line connects these three points, showing that accuracy increases as output tokens increase.

Sonnet 4.5 (yellow single point):
	•	A single dot around 22,000 tokens and ~77% accuracy, with no line connecting it.

Axes & Notes:
	•	Y-axis: Accuracy (%) from 70 to 85
	•	X-axis: Output tokens from 0 to 25,000
	•	Caption notes that measurements were done with extended thinking off, and that turning it on increases output tokens by +5.4% on average.

Legend:
	•	⬤ Opus 4.5 (orange)
	•	⬤ Sonnet 4.5 (yellow)

The chart illustrates how higher “effort” (i.e., more output tokens) boosts software-engineering accuracy, especially for Opus 4.5.
8 1
Opus 4.5 *just barely* missed scaling concern level 4

lets dive in, what does level 4 look like? (chatgpt summary there on right)

😳
1.2.4 Conclusions
Our determination is that Claude Opus 4.5 does not cross either the AI R&D-4 or CBRN-4 capability threshold. However, confidently ruling out these thresholds is becoming increasingly difficult. This is in part because the model is approaching or surpassing high levels of capability in our "rule-out" evaluations (early proxies of each threshold). In addition, parts of the Al R&D-4 and CBRN-4 thresholds have fundamental epistemic uncertainty or require more sophisticated forms of measurement. We are launching Claude
1. Al R&D-4
This means the model can fully automate the work of an entry-level, remote-only Al researcher at Anthropic. Hitting this level would imply the model could do substantial autonomous Al research/dev work end-to-end, which raises big
"runaway capability" and misalignment risks.
Anthropic
2. CBRN-4
This means the model could substantially uplift a moderately resourced state
CBRN program (chemical/biological/ radiological/nuclear weapons), e.g., by enabling novel weapon design, big acceleration of processes, or lowering technical barriers. Anthropic
operationalizes this as uplifting a team of entry-level PhD-skill biologists toward world-class state-backed capability.
Anthropic
7 1
36 likes 2 reposts

More like this

×