Bluesky Thread

they corrected this already, but 😂

View original thread
they corrected this already, but 😂
Bar chart titled “SWE-bench Verified: Software engineering” shows accuracy (pass@1) on the SWE-bench benchmark for three models—GPT-5, OpenAI o3, and GPT-4o:
	•	GPT-5 has two stacked bars:
	•	Without thinking: 52.8% (light pink)
	•	With thinking: total reaches 74.9% (top darker pink segment adds +22.1%)
	•	OpenAI o3: 69.1% (single pink-outlined bar)
	•	GPT-4o: 30.8% (single pink-outlined bar)

Legend:
	•	Light pink = without thinking
	•	Darker pink = with thinking

GPT-5 shows the highest performance overall and the only model with a visible thinking vs. non-thinking breakdown.
21
1 hour later
really killing these charts today
Bar chart titled “Coding deception” showing deception rates (in %) on the y-axis. There are two vertical bars:
	•	The left bar is filled with light pink and labeled “50.0”, indicating a 50.0% deception rate.
	•	The right bar is an unfilled outline and labeled “47.4”, representing a 47.4% deception rate.

Both bars correspond to the same x-axis label, “Coding deception,” implying a comparison between two variants or conditions related to coding deception. The background is a very light green. The vertical axis is labeled “Deception rate (%)”.
8 1
21 likes 0 reposts

More like this

×