Why Naive Quantization Destroys Model Quality (and How llm.int8 Fixes It)

You might think quantizing a model from FP16 to INT8 is as simple as converting all the weights to 8-bit format. That’s the trap — and it’s exactly why so many quantization attempts obliterate model quality.

The real problem isn’t “reduced precision.” It’s outliers: those rare activation values that are 100× larger than the rest. When you try to fit them into the same quantization range, everything else gets crushed. Think of it like compressing a skyscraper and a house with the same ruler.

What actually happens

FP16 weights get scaled to fit into [-127, 127] and stored as INT8.
Outlier activations hijack the scaling factor.
Normal activations — which make up 99.99% of the data — collapse into just a few quantized values.

The math is brutal:

Quantization range: [-127, 127] → 254 discrete values
One outlier = 50 → scale = 50 / 127 = 0.39
Normal value = 0.3 → round(0.3 / 0.39) = 1
Only ~2 out of 254 values are actually used → 99.2% precision lost

You’re using a bathroom scale to weigh an ant and an elephant together.

And here’s the key insight: outliers are dimension-specific, not token-specific. Feature dimension #2145 might always produce huge activations (±40 to ±60), while dimension #891 stays tiny (±0.2 to ±0.5). This pattern holds across all prompts and batches — it’s baked into the model’s structure.

Why naive quantization fails

One uniform scale per tensor.
All weights treated equally.
Result: model perplexity jumps from 12.3 → 2,847.

The fix: mixed precision quantization (`llm.int8`)

Instead of crushing everything under a single scale, llm.int8 recognizes and isolates the outliers. The recipe is simple:

Identify the top 0.5% of outlier feature dimensions.
Keep those in FP16.
Quantize the remaining 99.5% to INT8.

The results are dramatic:

Model perplexity: 12.3 → 12.6 (basically unchanged)
Memory bandwidth: 140 GB → 70 GB per forward pass
Throughput: up 1.8×
Model quality: ≈ 98% preserved
A 70B model that once needed 4× A100s now fits on 2× A100s

The difference isn’t about “using 8 bits instead of 16.” It’s about respecting outliers — and designing your quantization around them.

Further insights

Modern quantization methods go even further. They use per-channel scaling and dynamic calibration to ensure each dimension or layer gets its own range, minimizing information loss. Some implementations (like bitsandbytes) integrate this logic directly into matrix multiplication kernels — avoiding costly dequantization and improving efficiency end-to-end.

In short: quantization isn’t just a compression trick — it’s a test of how well you understand your model’s activation landscape. Handle outliers correctly, and you get performance and speed. Ignore them, and your model collapses under its own extremes.

References

Dettmers et al. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.
Tim Dettmers’ original blog post explaining outlier-aware quantization.
Community benchmarks on bitsandbytes and mixed-precision inference for large models.

Author’s note: This post was refined with help from large language models to clarify intuition and keep the math honest.

What actually happens

Why naive quantization fails

The fix: mixed precision quantization (llm.int8)

Further insights

References

The fix: mixed precision quantization (`llm.int8`)