Bluesky Thread Scaling Laws for Precision November 12, 2024 View original thread Scaling Laws for Precisionyes, llama models are harder to quantize. They’re “overtrained”, on more data, so quantization removes a lot of critical information arxiv.org/abs/2411.04330 arxiv.org Scaling Laws for Precision Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for b... 22 3 This whole paper is a whole lot of validation for how I assumed things worke.g. a 1B model trained in fp4 has the same effective parameter count as a 256M model trained in bf16 2