Bluesky Thread

Surprising: Math requires a lot of memorization

View original thread
Surprising: Math requires a lot of memorization

Goodfire is at it again!

They developed a method similar to PCA that measures how much of an LLM’s weights are dedicated to memorization

www.goodfire.ai/research/und...
A bar chart titled “Relative benchmark performance after K-FAC edit.”

The y-axis shows K-FAC Edit Accuracy / Baseline (ranging from 0.0 to 1.0).
The x-axis lists various benchmarks from left to right, grouped by category and color-coded:
	•	Dark blue (Memory): Heldout, Quotes — strong drop, near zero to 0.2.
	•	Light blue (Math): GSM8K, MMLU-Pro Math, SimpleMath — moderate performance (~0.65–0.75).
	•	Pale blue (Closed-book QA): PopQA, TriviaQA, Relations — higher (~0.8–0.9).
	•	Light orange (Open-book QA): TriviaQA-Open, BoolQ, OBQA — near 1.0.
	•	Red-orange (Logic): Boar, Etruscan, Winogrande, Logical Deduction, Tracking Objs, Bool Expr. — around 1.0 or slightly above.

At the bottom, a gradient arrow labeled “Memorization (specialized patterns)” → “Reasoning (shared mechanisms)” illustrates the trend: memory-heavy tasks degrade sharply after K-FAC editing, while reasoning-based tasks retain or improve performance.
40 6
this really highlights how LLMs do math

math is a string of many operations, so one small error (e.g. a misremembered shortcut) causes cascading calculation errors downstream
In between those extremes lie tasks like math and question-answering. Perhaps surprisingly, some mathematical tasks seem to rely on memorization-heavy structure more than most of the other tasks we tested. When the model solves an arithmetic problem like "30 + 60," its learnt rule appears to recruit parts of the model that are also used for memorized sequences, so removing those components often disrupts these precise operations.
In the example below from GSM8K, the reasoning chain remains intact, but the model makes an arithmetic mistake in the final calculation. This and similar examples seem to indicate that the reduced performance on math benchmarks comes largely from arithmetic errors. Since solving word problems requires both reasoning (to understand and formalize the question) and calculation, the edited model's poor arithmetic abilities mean it does poorly on the overall math benchmarks - even though its reasoning capabilities are preserved.
5
a big reason for this research is figuring out what a “cognitive core” might look like, a 1B model that relies on external knowledge banks

it’s interesting that math suffers, but i don’t think that would be the case for a 1B trained from scratch, it wouldn’t rely on those shortcuts
6
i’m curious if you could also patch a lot of this

go back and post-train it to distrust memories. maybe RL it with an external memory bank
3
40 likes 6 reposts

More like this

×