Bluesky Thread

Why do LLMs fail at long horizon tasks?

September 14, 2025 View original thread

Why do LLMs fail at long horizon tasks?

Because errors in execution, not planning

i.e. an error made early on conditions the LLM into a bad state later. Newer agentic LLMs deal with early errors better

arxiv.org/abs/2509.09677

The image is a diagram explaining how models self-condition on their errors and progressively take worse steps.

⸻

Title:

Models Self-Condition On Their Errors, Taking Worse Steps

⸻

Top Graph:
• Y-axis: Step Accuracy
• X-axis: Task Length
• Green line (Expected): A flat line showing that accuracy should remain steady over task length.
• Red line (Observed): Starts near the green line but declines steadily downward as task length increases, showing reduced accuracy.

⸻

Bottom Left (Correct Execution):
• Execution History: A sequence of green checkmarks.
• Instruction: “Add 56 and -92” inside a blue speech bubble.
• Robot icon: Wearing sunglasses and smiling (indicating success).
• Output: Green box with calculation 56 + -92 = -36 and a green checkmark.

⸻

Bottom Right (Error Propagation):
• Execution History: A sequence of mostly red X marks, with one green check.
• Instruction: “Add 56 and -92” inside a blue speech bubble.
• Robot icon: Dizzy-faced with spirals in its eyes (indicating confusion or error).
• Output: Pink box with incorrect calculation 56 + -92 = -24 and a red X.

⸻

The diagram highlights how execution history impacts accuracy: correct histories support good outcomes, while errors accumulate and lead to worse steps over time.

52 5

hypothesis: we’re confusing planning & execution

experiment: separate them by providing the plan

result:
- GPT-5 runs for 1000+ steps
- Sonnet 4: 432
- Grok 4: 384
- Gemini 2.5 Pro: 120
- R1: 120

Why does it feel like we’re hitting a wall with LLMs?

Because we’re looking at the wrong metrics

If you look only at single step accuracy, we indeed have hit a wall (even regressed in some areas)

But if you look at long horizon tasks (the stuff that matters imo), we’re just getting started

The image is a two-panel chart illustrating how diminishing gains on single-step accuracy can translate into exponential gains over long task horizons.

⸻

Title (top, bold text):

Diminishing Gains On A Single Step Can Lead To Exponential Gains Over Long Horizon

⸻

Left Panel:
• X-axis: Model Release Date
• Y-axis: Step Accuracy
• Curve: Red curve starting near 0.92 accuracy, rising steeply, then flattening near 1.00.
• Points: Four colored circular markers (pink, yellow, green, blue) placed along the curve to show progress across models.

⸻

Right Panel:
• X-axis: Model Release Date
• Y-axis: Task Length (0 to 20k)
• Curve: Blue curve starting near zero, remaining low, then shooting upward exponentially.
• Points: Same four colored circular markers (pink, yellow, green, blue) mapped correspondingly from the left chart, showing how small step accuracy improvements allow much longer task lengths.

⸻

Bottom caption:

“assuming step accuracy is constant across all steps of the task”

⸻

The diagram demonstrates that even small improvements in per-step accuracy (left chart) yield disproportionately large benefits in handling longer tasks (right chart).

cheers to open source AI releasing tons of model sizes so we can observe stuff like this

The image is a chart showing how scaling model size enables execution of longer tasks.

⸻

Title (top, bold text):

Scaling Model Size Enables Execution of Longer Tasks

⸻

Chart:
• X-axis: Model Size (Billion Parameters)
• Y-axis: Task Length (0 to 12)
• Two model families are plotted:
• Qwen3 (blue line with circular markers)
• Gemma3 (red line with circular markers)
• Both curves show a positive trend: as model size increases, the task length increases.
• Qwen3 consistently reaches slightly higher task lengths than Gemma3 at the same size.
• Example: At ~32B parameters, Qwen3 achieves ~12 task length, while Gemma3 achieves ~9.

⸻

Legend:
• Blue dots/line: Qwen3
• Red dots/line: Gemma3

⸻

The plot emphasizes that larger models can sustain longer reasoning or execution chains, with Qwen3 showing stronger scaling than Gemma3.

Reasoning models do better

this makes sense to me. Reasoning models have been known to correct or look past mistakes

GPT-5 is naturally going to knock this out of the park because they really doubled down on test-time compute

my post from July:

timkellogg.me/blog/2025/07...

The Impact of Thinking. We find recent thinking models are not affected by prior mistakes, fixing self-conditioning. Further, sequential test time compute greatly improves the length of task a model can complete in a single turn. Where without CoT, frontier LLMs like DeepSeek-V3 fail at performing even two steps of execution, its thinking version R1 can execute 200, highlighting the importance of reasoning before acting (Yao et al., 2023). We benchmark frontier thinking models, and find GPT-5 thinking (codename "Horizon" can execute over
1000 steps, far ahead of the next best competitor,
Claude-4-Sonnet at 432.

curiously, i’ve wondered in the past if text diffusion models similarly also look past mistakes since they don’t generate autoregressively

i’d love to see how a diffusion model does on this benchmark

timkellogg.me/blog/2025/02...

timkellogg.me

LLaDA: LLMs That Don't Gaslight You

A new language model uses diffusion instead of next-token prediction. That means the text it can back out of a hallucination before it commits. This is a big win for areas like law & contracts, where ...

More like this