Bluesky Thread

the top 2 ARC entries are by individuals

View original thread
the top 2 ARC entries are by individuals

here, Eric Pang breaks down how he added memory to avoid recomputing learned lessons

ctpang.substack.com/p/arc-agi-2-...
The chart is titled “ARC-AGI LEADERBOARD” and plots Score (%) on the y-axis (0–100%) against Cost per Task ($) on the x-axis (log scale, $1e-3 → $1K).**

At the very top near 90–100% score:
	•	Human Panel (grey dot) sits around 95–100% accuracy at very high cost (far right side, >$100).

Just below the Human Panel, in the 70–80% score range:
	•	J. Berman (2025) (orange triangle) and E. Pang (2025) (orange circle) are positioned in the 70–80% band, further right in cost (approx $10–$100).
	•	Grok 4 (Thinking) (magenta triangle) reaches around 70% accuracy, sitting higher than most models, at mid-to-high cost ($1–$10).

In the 60–70% band near the top left/mid:
	•	o3-Pro (Medium) (blue circle) scores around 65%, with costs around $1–$10.
	•	o4-mini (High) (blue circle) also scores 65%, at slightly lower cost ($1).

A bit lower but still clustered near the upper half:
	•	Gemini 2.5 Pro (Thinking 16K) (green dot) around 55–60% score, with moderate cost (~$0.10–$1).
	•	Claude Opus 4 (Thinking 16K) (red dot) 55–60% score at higher cost ($1–$10).

Summary:
	•	Highest performers: Human Panel (~95–100%), Grok 4 (Thinking) (~70%), o3-Pro and o4-mini (~65%).
	•	Notable human entries: J. Berman and E. Pang (2025) scoring 70–80%.
	•	Strong models at moderate costs: Gemini 2.5 Pro (Thinking 16K) and Claude Opus 4 (Thinking 16K).
35 4
Eric tried several LLMs and went with the one that worked best, Grok 4 Thinking
This table compares models on ARC-AGI-1 and ARC-AGI-2 benchmarks, showing both accuracy scores and cost per task.
	•	Efficient Evolutionary Program Synthesis:
	•	ARC-AGI-1: 77.1%, $2.56/task
	•	ARC-AGI-2: 26.0%, $3.97/task
	•	Highest performance, but expensive.
	•	Grok-4 (Thinking):
	•	ARC-AGI-1: 66.7%, $1.01/task
	•	ARC-AGI-2: 16.0%, $2.17/task
	•	Strong results at mid cost.
	•	GPT-5 (High):
	•	ARC-AGI-1: 65.7%, $0.51/task
	•	ARC-AGI-2: 9.9%, $0.73/task
	•	Balanced — good accuracy and lowest cost among high performers.
	•	Claude Opus 4 (Thinking 16K):
	•	ARC-AGI-1: 35.7%, $1.25/task
	•	ARC-AGI-2: 8.6%, $1.93/task
	•	Lower scores at higher cost.
	•	Claude Sonnet 4:
	•	ARC-AGI-1: 40.0%, $0.37/task
	•	ARC-AGI-2: 5.9%, $0.49/task
	•	Cheapest, but weakest scores.

Summary:
	•	Best raw performance: Efficient Evolutionary Program Synthesis (77.1%).
	•	Best cost-performance balance: GPT-5 (High).
	•	Grok-4 (Thinking) sits between them in tradeoff.
	•	Claude Opus 4 and Sonnet 4 underperform in both score and efficiency.
4
Jeremy Berman has the top submission (but more expensive), also Grok 4. His write-up here

it works by spawning subagents to solve sub-problems in parallel

both of these are open source

jeremyberman.substack.com/p/how-i-got-...
jeremyberman.substack.com
How I got the highest score on ARC-AGI again swapping Python for English
Using Multi-Agent Collaboration with Evolutionary Test-Time Compute
10
3 hours later
correction: Berman was a synthesis of Anthropic, Grok, OpenAI, DeepSeek & Gemini
4
35 likes 4 reposts

More like this

×