Bluesky Thread

to think that o3-mini was my choice model for a long time, and now gpt-oss:20...

View original thread
to think that o3-mini was my choice model for a long time, and now gpt-oss:20B is basically equivalent and runs on my laptop 🤯
44 2
comparison
Grouped bar chart with five subplots comparing model accuracy across different tasks and benchmarks. Each subplot shares the same x-axis: models o3, o3-mini, o4-mini, gpt-oss-120b, and gpt-oss-20b. Color-coded bars represent different models.

Top row:
	1.	AIME 2024 (Competition Math, With Tools)
	•	o3: 95.2%
	•	o3-mini: 87.3%
	•	o4-mini: 98.7%
	•	gpt-oss-120b: 96.6%
	•	gpt-oss-20b: 96.0%
	2.	AIME 2025 (Competition Math, With Tools)
	•	o3: 98.4%
	•	o3-mini: 86.5%
	•	o4-mini: 99.5%
	•	gpt-oss-120b: 97.9%
	•	gpt-oss-20b: 98.7%
	3.	GPQA Diamond (PhD Science Questions, Without Tools)
	•	o3: 83.3%
	•	o3-mini: 77.0%
	•	o4-mini: 81.4%
	•	gpt-oss-120b: 80.1%
	•	gpt-oss-20b: 71.5%

Bottom row:
	4.	HLE (Expert-Level Questions)
	•	o3 (tools): 24.9%
	•	o3-mini (no tools): 13.4%
	•	o4-mini: 17.7%
	•	gpt-oss-120b (tools): 19.0%
	•	gpt-oss-120b (no tools): 14.9%
	•	gpt-oss-20b (tools): 17.3%
	•	gpt-oss-20b (no tools): 10.9%
	5.	MMLU (College-level Exams)
	•	o3: 93.4%
	•	o3-mini: 87.0%
	•	o4-mini: 93.0%
	•	gpt-oss-120b: 90.0%
	•	gpt-oss-20b: 85.3%

Overall, o4-mini performs strongest on AIME and MMLU. o3 has the highest HLE accuracy with tools. GPT-based models trail behind across most tasks.
11
44 likes 2 reposts

More like this

×