— How to make models faster, smaller and maybe better

Quantization is the process of mapping a large set of continuous input values to a smaller, discrete set of output levels, effectively converting high-precision numbers (decimals) into low-precision numbers (integers) to save space and speed up math. Usually it's about changing float numbers (like 3.14) into integers (like 3), this makes things faster and uses less memory.


It's used to make models smaller and faster. We do this by shrinking the precision of weights and activations. There is a trade-off, you lose a bit of accuracy, but the model runs faster and takes less space.

<aside> 💡

Imagine you have a high-definition photograph of a sunset. Every pixel can be one of 16 million colors. This is FP32 (Floating Point 32). Now, imagine you have to recreate that sunset using only a standard 64-pack of LEGO bricks.

image.png

You cannot represent every exact shade of orange. Instead, you must "bin" similar colors together. If a pixel is "Burnt Orange" or "Sunset Gold," you might map both to a single "Standard Orange" LEGO brick.

</aside>


To understand the flow, we must look at how a single Neuron processes data during Quantization.


The input is typically a high-precision tensor of FP32 values (e.g., [0.1223, -0.5481, 0.9923]).

Inside the "Quantization Engine," the data undergoes a Linear Transformation. Since hardware like CPUs and GPUs can add and multiply integers much faster than decimals, we transform the FP32 values into INT8 (integers from -128 to 127).

  1. Scaling: We find the maximum and minimum values in the input data.
  2. Shifting: We calculate a "Zero Point" to ensure the value $0.0$ in floating point maps exactly to an integer.
  3. Rounding: We "snap" the scaled decimals to the nearest whole integer.
  4. Clipping: Any value outside the -128 to 127 range is forced into the boundary.

https://aifordevelopers.io/scale-and-zero-points-in-quantization/

https://aifordevelopers.io/scale-and-zero-points-in-quantization/

$$ q = \text{round}\left( \frac{r}{S} + Z \right) $$

$$ \begin{aligned}& r && \text{The original high-precision real value (FP32).} \\& q && \text{The resulting low-precision quantized integer (e.g., INT8).} \\& S && \text{The Scale factor (a positive FP32 number) that shrinks the range.} \\& Z && \text{The Zero-point (an integer) representing the real-world 0.} \\& \text{round} && \text{The function that snaps a decimal to the nearest whole number.}\end{aligned} $$

The output is an INT8 Tensor. When this tensor needs to interact with other layers, it can be "Dequantized" back to a floating-point approximation, or the entire math operation (multiplication/addition) can happen purely in integer space using CPU instructions.


Why quantize?