Bluesky Thread

Physics of Language Models: Part 3.1

November 16, 2025 View original thread

Physics of Language Models: Part 3.1

If you show a fact to an LLM in pre-training once, it’ll memorize the form but not the fact itself.

but if you (synthetically) rephrase the text several times, it’ll memorize the fact

arxiv.org/abs/2309.14316

arxiv.org

Physics of Language Models: Part 3.1, Knowledge Storage and Extraction

Large language models (LLMs) can store a vast amount of world knowledge, often extractable via question-answering (e.g., "What is Abraham Lincoln's birthday?"). However, do they answer such questions ...

74 6

Baguettotron & Monad (h/t @dorialexander.bsky.social) were proof of this concept

Their pretraining training data was nothing other than Wikipedia’s most vital articles synthetically rephrased (with LLMs) many times

huggingface.co/PleIAs/Bague...

Baguettron has been natively trained for instructions with thinking traces. We implemented a series of dedicated pipelines for:
• Memorization of encyclopedic knowledge
(50,000 vital articles from Wikipedia)

Not every token is equal

pre-training scaling laws predict improvement in “loss”, effectively “ability to compress”

but that doesn’t guarantee real-world performance. Better loss on a more helpful dataset is 100% going to lead to better real-world performance

so is pre-training scaling dead?

hell no. It’s just that it wasn’t the lowest hanging fruit, scaling up to 10T, 100T is really f expensive. While rephrasing data is cheap

if you can get the same perf from a 100B as a 1T, the former is going to be a lot easier to work with

e.g. with Monad, they can fully retrain it in like 8 hours

that’s the sort of timelines that let you experiment a ton with what sorts of synthetic rephrasing helps the most

tiny models are excellent test beds

More like this