Bluesky Thread

Scaling Laws for Precision

View original thread
Scaling Laws for Precision

yes, llama models are harder to quantize. They’re “overtrained”, on more data, so quantization removes a lot of critical information

arxiv.org/abs/2411.04330
arxiv.org
Scaling Laws for Precision
Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for b...
22 3
This whole paper is a whole lot of validation for how I assumed things work

e.g. a 1B model trained in fp4 has the same effective parameter count as a 256M model trained in bf16
2
22 likes 3 reposts

More like this

×