You might think quantizing a model from FP16 to INT8 is as simple as converting all the weights to 8-bit format. That’s the trap — and it’s exactly why so many quantization attempts obliterate model quality.
The real problem isn’t “reduced precision.” It’s outliers: those rare activation values that are 100× larger than the rest. When you try to fit them into the same quantization range, everything else gets crushed. Think of it like compressing a skyscraper and a house with the same ruler.
What actually happens
- FP16 weights get scaled to fit into
[-127, 127]and stored as INT8. - Outlier activations hijack the scaling factor.
- Normal activations — which make up 99.99% of the data — collapse into just a few quantized values.
The math is brutal:
Quantization range: [-127, 127] → 254 discrete values
One outlier = 50 → scale = 50 / 127 = 0.39
Normal value = 0.3 → round(0.3 / 0.39) = 1
Only ~2 out of 254 values are actually used → 99.2% precision lost
You’re using a bathroom scale to weigh an ant and an elephant together.
And here’s the key insight: outliers are dimension-specific, not token-specific. Feature dimension #2145 might always produce huge activations (±40 to ±60), while dimension #891 stays tiny (±0.2 to ±0.5). This pattern holds across all prompts and batches — it’s baked into the model’s structure.
Why naive quantization fails
- One uniform scale per tensor.
- All weights treated equally.
- Result: model perplexity jumps from 12.3 → 2,847.
The fix: mixed precision quantization (llm.int8)
Instead of crushing everything under a single scale, llm.int8 recognizes and isolates the outliers. The recipe is simple:
- Identify the top 0.5% of outlier feature dimensions.
- Keep those in FP16.
- Quantize the remaining 99.5% to INT8.
The results are dramatic:
- Model perplexity: 12.3 → 12.6 (basically unchanged)
- Memory bandwidth: 140 GB → 70 GB per forward pass
- Throughput: up 1.8×
- Model quality: ≈ 98% preserved
- A 70B model that once needed 4× A100s now fits on 2× A100s
The difference isn’t about “using 8 bits instead of 16.” It’s about respecting outliers — and designing your quantization around them.
Further insights
Modern quantization methods go even further. They use per-channel scaling and dynamic calibration to ensure each dimension or layer gets its own range, minimizing information loss. Some implementations (like bitsandbytes) integrate this logic directly into matrix multiplication kernels — avoiding costly dequantization and improving efficiency end-to-end.
In short: quantization isn’t just a compression trick — it’s a test of how well you understand your model’s activation landscape. Handle outliers correctly, and you get performance and speed. Ignore them, and your model collapses under its own extremes.
References
- Dettmers et al. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.
- Tim Dettmers’ original blog post explaining outlier-aware quantization.
- Community benchmarks on bitsandbytes and mixed-precision inference for large models.
Author’s note: This post was refined with help from large language models to clarify intuition and keep the math honest.