Bluesky Thread

from the original GSMK8 paper, they project that it would require a 10 quadri...

View original thread
from the original GSMK8 paper, they project that it would require a 10 quadrillion parameter model to get an 80% on GSMK8

Gemma 3 4B got 89%
for each test problem. Unsurprisingly, we see that the 175B model significantly outperforms the smaller models. Assuming a log-linear trend, we can naively extrapolate these results to estimate that a model with 1016 parameters would be required to reach an 80% solve rate, when using the full GSM8K training set. It is even harder to extrapolate along the data dimension, since performance does not appear to follow a log-linear trend. Nevertheless, it appears likely that the 175B model would require at least two additional orders of magnitude of training data to reach an 80% solve rate.
23 3
question: what got us here?

i argue it was data.

obvs everything played a part, but there’s papers showing as much as multiple orders of magnitude drop in data requirements

also: we used to get freaked out by synthetic data, we’ve gotten over that now

bsky.app/profile/timk...
Tim Kellogg @timkellogg.me
data is more important than algorithms
a man saying “yes, you are all wrong”, surrounded by like a thousand people
7
i think you could also argue that it wouldn’t be possible to jump straight to these levels. not simply difficult, but outright impossible

so much of what we do these days in training leverages other LLMs
9
bsky.app/profile/dori...
Alexander Doria @dorialexander.bsky.social
Model collapse was actually a remarkable piece of memetic warfare: copyright holders are falling into the trap and in deep denial over the turn to synth.
2
23 likes 3 reposts

More like this

×