Bluesky Thread

to think that o3-mini was my choice model for a long time, and now gpt-oss:20...

August 05, 2025 View original thread

to think that o3-mini was my choice model for a long time, and now gpt-oss:20B is basically equivalent and runs on my laptop 🤯

44 2

comparison

Grouped bar chart with five subplots comparing model accuracy across different tasks and benchmarks. Each subplot shares the same x-axis: models o3, o3-mini, o4-mini, gpt-oss-120b, and gpt-oss-20b. Color-coded bars represent different models.

Top row:
1. AIME 2024 (Competition Math, With Tools)
• o3: 95.2%
• o3-mini: 87.3%
• o4-mini: 98.7%
• gpt-oss-120b: 96.6%
• gpt-oss-20b: 96.0%
2. AIME 2025 (Competition Math, With Tools)
• o3: 98.4%
• o3-mini: 86.5%
• o4-mini: 99.5%
• gpt-oss-120b: 97.9%
• gpt-oss-20b: 98.7%
3. GPQA Diamond (PhD Science Questions, Without Tools)
• o3: 83.3%
• o3-mini: 77.0%
• o4-mini: 81.4%
• gpt-oss-120b: 80.1%
• gpt-oss-20b: 71.5%

Bottom row:
4. HLE (Expert-Level Questions)
• o3 (tools): 24.9%
• o3-mini (no tools): 13.4%
• o4-mini: 17.7%
• gpt-oss-120b (tools): 19.0%
• gpt-oss-120b (no tools): 14.9%
• gpt-oss-20b (tools): 17.3%
• gpt-oss-20b (no tools): 10.9%
5. MMLU (College-level Exams)
• o3: 93.4%
• o3-mini: 87.0%
• o4-mini: 93.0%
• gpt-oss-120b: 90.0%
• gpt-oss-20b: 85.3%

Overall, o4-mini performs strongest on AIME and MMLU. o3 has the highest HLE accuracy with tools. GPT-based models trail behind across most tasks.

11

More like this

1) What

MiniMax open sources M2

GpT5 iS sUcH a GrEaT cOdInG mOdEl