Infrastructure
Quantization
Reducing the numerical precision of model weights (e.g. FP16 to INT8 or INT4) to shrink size and speed up inference with minimal accuracy loss.
Quantization is a model-compression technique that lowers the numerical precision of a model’s weights and activations — for example, storing numbers as 8-bit or 4-bit integers (INT8, INT4) instead of 16- or 32-bit floating-point values (FP16, FP32).
In practice each number is mapped from a wide continuous range onto a smaller set of discrete values using a scaling factor. This cuts model size by 2–4x, reduces memory footprint, and speeds up inference, because integer arithmetic uses far less energy and bandwidth than floating-point math. Two main approaches exist: post-training quantization (PTQ), applied to an already-trained model, and quantization-aware training (QAT), which simulates the precision loss during training for higher accuracy.
Through 2025–2026, quantization is central to running large models on modest hardware. Formats such as GGUF (llama.cpp) and methods like GPTQ and AWQ let models with tens of billions of parameters fit on a single consumer GPU or local AI accelerator, democratizing access to open-weight models.