Quantization Trades Precision for Accessibility

May 1, 2024

Quantization, reducing the numerical precision of model weights and activations, is the single most impactful technique for making large models usable on smaller hardware, trading controlled accuracy loss for dramatic cost and latency improvements.

"Quantizing the models reduces their size by approximately 60%, resulting in a decrease in the cost of hosting the LLM and an improvement in latency. But at what accuracy compromise? Well, for the selected benchmark, Llama2-13B has shown better results than Llama2-7B, although it is almost 50% of the size." TensorOps, Understanding the Cost of LLMs

The default precision for training large language models is BF16, using 16 bits per parameter. A 70-billion parameter model at BF16 requires roughly 140GB of memory just for the weights, more than a single H100 can hold. Quantization to INT8 or FP8 halves that, and INT4 cuts it to a quarter. The practical consequence is the difference between needing eight GPUs and needing two, or between running inference on a datacenter server and running it on a laptop.

The precision landscape has become remarkably complex. Nvidia's Hopper architecture supports FP8, INT8, and INT4 natively. Microsoft's Maia 100 chip was designed with MXInt8 and MXFP4 formats from the OCP standard. The MI300X benchmarks reveal that FP8 performance depends enormously on software maturity the H100 achieves about 1,280 TFLOP/s in FP8, while the MI300X reaches only about 990 TFLOP/s despite having higher theoretical peak, because AMD's software stack for lower precision is less mature.

The deeper insight is that quantization is not merely a compression technique but an accessibility revolution. The llama.cpp project demonstrated that a 13B parameter model could be compressed to 8GB in quantized form, small enough to run on consumer hardware. As one observer noted, "It's awfully poignant seeing these models in dense, quantized, single-file form. It doesn't seem like it should be possible to compress so much knowledge in 8GB." Quantization is what transforms AI from a datacenter-only resource into something that runs at the edge.

Quantization is the great equalizer of AI deployment. It determines who can run which models, and every halving of precision roughly doubles the number of people who can afford to participate.