Bluesky Thread

Is 32B-4bit equal to 16B-8bit? Depends on the task

October 15, 2025 View original thread

Is 32B-4bit equal to 16B-8bit? Depends on the task

* math: precision matters
* knowledge: effective param count is more important
* 4B-8bit threshold — for bigger prefer quant, smaller prefer more params
* parallel TTC only works above 4B-8bit

arxiv.org/abs/2510.10964

A scatter plot titled “AIME25 — Total Memory vs. Accuracy (Qwen3)” compares model accuracy (%) against total memory usage (weights + KV cache, in GB) for various Qwen3 model sizes and quantization levels.

Axes:
• X-axis: Total Memory (Weight + KV Cache) [GB] (log scale, ranging roughly from 1 to 100)
• Y-axis: Accuracy (%), ranging from 0 to 75

Legend:
• Colors: model sizes —
• 0.6B (yellow)
• 1.7B (orange)
• 4B (salmon)
• 8B (pink)
• 14B (purple)
• 32B (blue)
• Shapes: precision levels —
• Circle: 16-bit
• Triangle: 8-bit
• Square: 4-bit
• Marker size: context length —
• Small: 2k tokens
• Large: 30k tokens

Main trend:
Larger models (rightward and darker colors) achieve higher accuracy but require significantly more memory. Smaller models (left, yellow/orange) stay below 30% accuracy. Compression (8-bit or 4-bit) lowers memory usage but can reduce accuracy slightly.

Inset zoom (upper center):
A close-up box highlights the 8B (8-bit) and 14B (4-bit) models showing their proximity in accuracy despite differing memory footprints.

Overall, the chart demonstrates scaling behavior for Qwen3 models—accuracy grows with total memory and model size, with diminishing returns beyond the 14B range.

31 8

The study is compelling because of how thorough they were

1700 experiments varying
* model size (0.6B-32B)
* weight precision (4-bit, 8, 16)
* serial TTC budget (2k tokens -> 30k)
* parallel TTC (maj@k, up to k=16)
KV cache compression (eviction, quantization, StreamingLLM, HQQ)

4 1

all experiments were with Qwen3 family

Qwen remains the absolute biggest help for science. Is anyone else producing a huge number of model sizes?

More like this