Physics of Language Models: Part 3.1
If you show a fact to an LLM in pre-training once, it’ll memorize the form but not the fact itself.
but if you (synthetically) rephrase the text several times, it’ll memorize the fact
arxiv.org/abs/2309.14316
Physics of Language Models: Part 3.1
View original threadBaguettotron & Monad (h/t @dorialexander.bsky.social) were proof of this concept
Their pretraining training data was nothing other than Wikipedia’s most vital articles synthetically rephrased (with LLMs) many times
huggingface.co/PleIAs/Bague...
Their pretraining training data was nothing other than Wikipedia’s most vital articles synthetically rephrased (with LLMs) many times
huggingface.co/PleIAs/Bague...
20
Not every token is equal
pre-training scaling laws predict improvement in “loss”, effectively “ability to compress”
but that doesn’t guarantee real-world performance. Better loss on a more helpful dataset is 100% going to lead to better real-world performance
pre-training scaling laws predict improvement in “loss”, effectively “ability to compress”
but that doesn’t guarantee real-world performance. Better loss on a more helpful dataset is 100% going to lead to better real-world performance
7
so is pre-training scaling dead?
hell no. It’s just that it wasn’t the lowest hanging fruit, scaling up to 10T, 100T is really f expensive. While rephrasing data is cheap
if you can get the same perf from a 100B as a 1T, the former is going to be a lot easier to work with
hell no. It’s just that it wasn’t the lowest hanging fruit, scaling up to 10T, 100T is really f expensive. While rephrasing data is cheap
if you can get the same perf from a 100B as a 1T, the former is going to be a lot easier to work with
7
e.g. with Monad, they can fully retrain it in like 8 hours
that’s the sort of timelines that let you experiment a ton with what sorts of synthetic rephrasing helps the most
tiny models are excellent test beds
that’s the sort of timelines that let you experiment a ton with what sorts of synthetic rephrasing helps the most
tiny models are excellent test beds
11