Bluesky Thread

this fits my mental model — LLMs do learn procedures. But it’s the same mec...

June 08, 2025 View original thread

this fits my mental model — LLMs *do* learn procedures. But it’s the same mechanics as what’s learning facts. So of course it would also hallucinate procedures

but also: what does procedure hallucination look like? i don’t think i have a grasp on that

machinelearning.apple.com/research/ill...

By comparing LRMs with their standard LLM counterparts under equivalent inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demons stes advantage, and (3) high-complexity tasks
where both models experience complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We also investigate the reasoning traces in more depth, studying the patterns of explored soli
ons
and analyzing the models' computational behavior, shedding light on their strengths, limitations, and ultimately raising crucial questions about their true reasoning capabilities.

22 3

This paper is being advertised as evidence against AGI, but I’m looking at these charts and…that’s how people operate too. That’s humans’ failure modes. Computers’ normally look entirely different. This is AGI straight into your veins

This image presents experimental results on how Claude 3.7 and Claude 3.7 (+thinking) solve Tower of Hanoi puzzles at increasing complexity levels.

Top row of plots:
• Left plot (Accuracy vs. Complexity): Claude 3.7 (+thinking) maintains 100% accuracy through 6 disks and then declines sharply. Claude 3.7 without thinking fails beyond 4 disks.
• Middle plot (Token usage vs. Complexity): The thinking model uses more tokens, peaking at 20,000 for 10 disks, whereas non-thinking Claude uses fewer tokens overall but also collapses early.
• Right plot (Position within thoughts): Shows when final answers occur in the trace. Correct solutions appear early for easy tasks, and later for harder ones. Incorrect ones often show premature termination.

Figure caption highlights:
• Bottom left & middle: Non-thinking models are more accurate and efficient at low complexity. But as task difficulty increases, thinking models outperform—though at the cost of more tokens—until a failure threshold is hit.
• Bottom right: In successful attempts, Claude 3.7 Thinking gives correct answers either early (simple cases) or late (harder cases). In failed attempts, it latches onto incorrect answers too early and wastes remaining tokens. This exposes inefficiencies in its reasoning approach.

this is a fine conclusion, but it’s phrased poorly

it leads the reader to believe that LRMs are inherently limited, whereas it’s actually just saying that LRMs aren’t computers, they’re something else

again, that’s what we’ve been saying

For simpler, low-compositional problems, standard LLMs demonstrate greater efficiency and accuracy. As problem complexity moderately increases, thinking models gain an advantage. However, when problems reach high complexity with longer compositional depth, both model types experience complete performance collapse (Fig. 1, bottom left). Notably, near this collapse point, LRMs begin reducing their reasoning effort (measured by inference-time tokens) as problem com 'exity increases, despite operating well below generation length limits (Fig. 1 bottom middle). This suggests a fundamental inference time scaling limitation in LRMs' reasoning capabilities relative to problem complexity
Finally, our analysis of intermediate reasoning traces or
thoughts reveals complexity-dependent pat.rns: In simpler problems, reasoning models often identify correct solutions early but inefficiently continue exploring incorrect alternatives an "overthinking" phenomenon. At moderate complexity, correct solutions emerge only after extensive exploration of incorrect paths. Beyond a certain complexity threshold, models completely fail to find correct solutions (Fig. I, bottom right). This indicates LRMs possess limited self-correction capabilities that, while valuable, reveal fundamental inefficiencies and clear scaling limitations.

all of this is very intuitive

some models are inherently more capable than others. Simply “thinking longer” doesn’t magically solve harder problems. The LLM still must be capable of tackling said problem

everyone i’ve ever talked to has this intuition, technical or not

the labs have also been saying this

OpenAI and Anthropic have been talking about how making this new crop of models is a combo of pre & post training scaling

if they still need pre-training, that means they still need to improve the fundamental nature of the model. Thinking isn’t a panacea

there was a time where i did think that reasoning would be a panacea. simply thinking longer during inference would tackle all problems

but that phase didn’t last long. even by the time my s1 post landed it didn’t feel right

timkellogg.me/blog/2025/02...

timkellogg.me

S1: The $6 R1 Competitor?

for example, if you misconfigure ollama so that it forgets to stop on the stop token, the response doesn’t get better, it gets FAR worse

so then maybe you can RL it into just thinking longer

s1 showed that was VERY easy (force it to say “wait,”) but we didn’t see anything like takeoff

More like this