← Go back to blogs

Why Naive Quantization Destroys Model Quality (and How llm.int8 Fixes It)

Quantization isn’t just about smaller numbers — it’s about how you handle the rare, dangerous ones.

You might think quantizing a model from FP16 to INT8 is as simple as converting all the weights to 8-bit format. That’s the trap — and it’s exactly why so many quantization attempts obliterate model quality.

The real problem isn’t “reduced precision.” It’s outliers: those rare activation values that are 100× larger than the rest. When you try to fit them into the same quantization range, everything else gets crushed. Think of it like compressing a skyscraper and a house with the same ruler.

What actually happens

The math is brutal:

Quantization range: [-127, 127] → 254 discrete values
One outlier = 50 → scale = 50 / 127 = 0.39
Normal value = 0.3 → round(0.3 / 0.39) = 1
Only ~2 out of 254 values are actually used → 99.2% precision lost
You’re using a bathroom scale to weigh an ant and an elephant together.

And here’s the key insight: outliers are dimension-specific, not token-specific. Feature dimension #2145 might always produce huge activations (±40 to ±60), while dimension #891 stays tiny (±0.2 to ±0.5). This pattern holds across all prompts and batches — it’s baked into the model’s structure.

Why naive quantization fails

The fix: mixed precision quantization (llm.int8)

Instead of crushing everything under a single scale, llm.int8 recognizes and isolates the outliers. The recipe is simple:

  1. Identify the top 0.5% of outlier feature dimensions.
  2. Keep those in FP16.
  3. Quantize the remaining 99.5% to INT8.

The results are dramatic:

The difference isn’t about “using 8 bits instead of 16.” It’s about respecting outliers — and designing your quantization around them.

Further insights

Modern quantization methods go even further. They use per-channel scaling and dynamic calibration to ensure each dimension or layer gets its own range, minimizing information loss. Some implementations (like bitsandbytes) integrate this logic directly into matrix multiplication kernels — avoiding costly dequantization and improving efficiency end-to-end.

In short: quantization isn’t just a compression trick — it’s a test of how well you understand your model’s activation landscape. Handle outliers correctly, and you get performance and speed. Ignore them, and your model collapses under its own extremes.

References

Author’s note: This post was refined with help from large language models to clarify intuition and keep the math honest.