Bluesky Thread

Scaling Laws for Precision

November 12, 2024 View original thread

Scaling Laws for Precision

yes, llama models are harder to quantize. They’re “overtrained”, on more data, so quantization removes a lot of critical information

arxiv.org/abs/2411.04330

arxiv.org

Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for b...

22 3

This whole paper is a whole lot of validation for how I assumed things work

e.g. a 1B model trained in fp4 has the same effective parameter count as a 256M model trained in bf16

More like this