Bluesky Thread

they corrected this already, but 😂

August 07, 2025 View original thread

they corrected this already, but 😂

Bar chart titled “SWE-bench Verified: Software engineering” shows accuracy (pass@1) on the SWE-bench benchmark for three models—GPT-5, OpenAI o3, and GPT-4o:
• GPT-5 has two stacked bars:
• Without thinking: 52.8% (light pink)
• With thinking: total reaches 74.9% (top darker pink segment adds +22.1%)
• OpenAI o3: 69.1% (single pink-outlined bar)
• GPT-4o: 30.8% (single pink-outlined bar)

Legend:
• Light pink = without thinking
• Darker pink = with thinking

GPT-5 shows the highest performance overall and the only model with a visible thinking vs. non-thinking breakdown.

21

1 hour later

really killing these charts today

Bar chart titled “Coding deception” showing deception rates (in %) on the y-axis. There are two vertical bars:
• The left bar is filled with light pink and labeled “50.0”, indicating a 50.0% deception rate.
• The right bar is an unfilled outline and labeled “47.4”, representing a 47.4% deception rate.

Both bars correspond to the same x-axis label, “Coding deception,” implying a comparison between two variants or conditions related to coding deception. The background is a very light green. The vertical axis is labeled “Deception rate (%)”.

8 1

More like this

i tore this apart this morning, the gist:

these graphs are nuts

Anthropic has no competitors, because nobody else sells Claude