from the original GSMK8 paper, they project that it would require a 10 quadrillion parameter model to get an 80% on GSMK8
Gemma 3 4B got 89%
from the original GSMK8 paper, they project that it would require a 10 quadri...
View original thread
23
3
question: what got us here?
i argue it was data.
obvs everything played a part, but there’s papers showing as much as multiple orders of magnitude drop in data requirements
also: we used to get freaked out by synthetic data, we’ve gotten over that now
bsky.app/profile/timk...
i argue it was data.
obvs everything played a part, but there’s papers showing as much as multiple orders of magnitude drop in data requirements
also: we used to get freaked out by synthetic data, we’ve gotten over that now
bsky.app/profile/timk...
data is more important than algorithms
7
i think you could also argue that it wouldn’t be possible to jump straight to these levels. not simply difficult, but outright impossible
so much of what we do these days in training leverages other LLMs
so much of what we do these days in training leverages other LLMs
9