Why do LLMs fail at long horizon tasks?
Because errors in execution, not planning
i.e. an error made early on conditions the LLM into a bad state later. Newer agentic LLMs deal with early errors better
arxiv.org/abs/2509.09677
Why do LLMs fail at long horizon tasks?
View original thread
52
5
hypothesis: we’re confusing planning & execution
experiment: separate them by providing the plan
result:
- GPT-5 runs for 1000+ steps
- Sonnet 4: 432
- Grok 4: 384
- Gemini 2.5 Pro: 120
- R1: 120
experiment: separate them by providing the plan
result:
- GPT-5 runs for 1000+ steps
- Sonnet 4: 432
- Grok 4: 384
- Gemini 2.5 Pro: 120
- R1: 120
2
Why does it feel like we’re hitting a wall with LLMs?
Because we’re looking at the wrong metrics
If you look only at single step accuracy, we indeed have hit a wall (even regressed in some areas)
But if you look at long horizon tasks (the stuff that matters imo), we’re just getting started
Because we’re looking at the wrong metrics
If you look only at single step accuracy, we indeed have hit a wall (even regressed in some areas)
But if you look at long horizon tasks (the stuff that matters imo), we’re just getting started
5
cheers to open source AI releasing tons of model sizes so we can observe stuff like this
4
Reasoning models do better
this makes sense to me. Reasoning models have been known to correct or look past mistakes
GPT-5 is naturally going to knock this out of the park because they really doubled down on test-time compute
my post from July:
timkellogg.me/blog/2025/07...
this makes sense to me. Reasoning models have been known to correct or look past mistakes
GPT-5 is naturally going to knock this out of the park because they really doubled down on test-time compute
my post from July:
timkellogg.me/blog/2025/07...
2
curiously, i’ve wondered in the past if text diffusion models similarly also look past mistakes since they don’t generate autoregressively
i’d love to see how a diffusion model does on this benchmark
timkellogg.me/blog/2025/02...
i’d love to see how a diffusion model does on this benchmark
timkellogg.me/blog/2025/02...
6