Bluesky Thread

Inverse scaling of reasoning models

July 22, 2025 View original thread

Inverse scaling of reasoning models

a research collab demonstrated that there are certain types of tasks where all top reasoning models do WORSE the longer they think

things like getting distracted by irrelevant info, spurious correlations, etc.

www.arxiv.org/abs/2507.14417

Three panels at the top describe task types with example prompts:
1. Simple Counting Tasks with Distractors (Misleading Math & Python):
• Prompts mention an apple and an orange, with added irrelevant or confusing information (e.g., probabilistic riddle, Python code) before asking the straightforward question: “Calculate how many fruits you have.”
2. Regression Tasks with Spurious Features (Grades Regression):
• Given XML-style records about a student, the model must predict grades from features like sleep hours, social hours, and stress level. The challenge lies in identifying relevant vs. spurious attributes.
3. Deduction Tasks with Constraint Tracking (Zebra Puzzles):
• Complex logical reasoning puzzle with multiple interrelated clues. Example: “What position is the person who likes salmon at?” Constraints involve foods, names, and relationships like “to the left of.”

Bottom row contains 3 line plots comparing model performance across tasks:
• Misleading Math (Left Plot):
• Accuracy drops sharply for some models as reasoning tokens increase. Claude Sonnet 4 maintains high performance. o3 and DeepSeek R1 hold relatively stable accuracy; Qwen3 32B and QwQ 32B drop more.
• Grades Regression (Middle Plot):
• Shows negative RMSE (higher is better). Claude models remain strong across token counts; o3 also performs well. Qwen3 and QwQ struggle, with DeepSeek R1 performing modestly.
• Zebra Puzzles (Right Plot):
• Accuracy vs. average reasoning tokens. o3 and Claude Sonnet 4 maintain highest performance. Other models (e.g., DeepSeek R1, Qwen3 32B, QwQ 32B) show performance degradation or plateaus. Error bars reflect variability.

Each plot uses colored lines with markers to indicate different model names.

21 2

it’s nuts to think how easy it is to do this to yourself and not realize it

how many times did i use o3-mini-high bc i really just wanted that extra juice, that mini high, regardless if i really NEEDED it

think".
", "think", "think harder"
2025a)) combined with specified reasoning budgets. For Claude and open-weight models, we specify an integer denoting the maximum number of tokens the model should use to reason (e.g., "0", "1,024", "2,048"
"4,096"), while for o-series models, we use their built-in budget levels (i.e., "low", "medium", "high"). We

i think this will be one of the big innovations of GPT-5: taking that part out of your hands so you don’t hurt yourself

More like this