Benchmark Overview
Quantisation saves VRAM and increases throughput, but at what cost to output quality? Marketing claims of “lossless quantisation” rarely hold across all tasks. We measured actual quality degradation on reasoning, coding, creative writing, and factual accuracy benchmarks comparing FP16, INT8, and INT4 precision on dedicated GPU hosting.
Test Configuration
Model: Llama 3 70B. Precisions: FP16 (baseline), INT8 (GPTQ), INT4 (AWQ). GPU: RTX 6000 Pro 96 GB via vLLM. Benchmarks: MMLU (knowledge), HumanEval (coding), MT-Bench (conversation quality), GSM8K (mathematical reasoning), TruthfulQA (factual accuracy). All benchmarks run with greedy decoding (temperature=0) for reproducibility. See token benchmarks for speed data.
Quality Benchmark Results
| Benchmark | FP16 (Baseline) | INT8 | INT8 Loss | INT4 (AWQ) | INT4 Loss |
|---|---|---|---|---|---|
| MMLU (5-shot) | 79.2% | 79.0% | -0.3% | 78.1% | -1.4% |
| HumanEval (pass@1) | 72.0% | 71.3% | -1.0% | 69.5% | -3.5% |
| MT-Bench (avg) | 8.42 | 8.38 | -0.5% | 8.25 | -2.0% |
| GSM8K | 82.5% | 81.8% | -0.8% | 79.2% | -4.0% |
| TruthfulQA | 58.3% | 57.9% | -0.7% | 56.8% | -2.6% |
Task-Specific Quality Analysis
INT8 quantisation is nearly lossless across all tasks. The maximum observed degradation is 1.0% on HumanEval (coding). For practical purposes, INT8 is indistinguishable from FP16 in production applications.
INT4 shows meaningful degradation on mathematical reasoning (GSM8K: -4.0%) and coding (HumanEval: -3.5%). Creative writing and general knowledge suffer less (MMLU: -1.4%, MT-Bench: -2.0%). The pattern is consistent: tasks requiring precise logical chains degrade more than tasks requiring broad knowledge retrieval. See GPU selection for VRAM-precision trade-offs.
Perplexity Measurements
| Precision | WikiText Perplexity | Code Perplexity | Conversation Perplexity |
|---|---|---|---|
| FP16 | 5.42 | 3.18 | 6.85 |
| INT8 | 5.45 | 3.22 | 6.90 |
| INT4 (AWQ) | 5.68 | 3.45 | 7.15 |
When Quality Loss Matters
Accept INT4: General chatbots, customer support, content generation, summarisation, and information retrieval. The 1-2% quality loss is unnoticeable to users. Deploy via the vLLM production guide and explore LLM hosting patterns.
Prefer INT8 or FP16: Mathematical reasoning pipelines, code generation tools, medical or legal applications where precision matters, and benchmark-critical deployments. Use multi-GPU clusters for FP16 70B inference on private AI hosting.
Recommendations
Default to INT4 AWQ for production inference. The VRAM savings (75% reduction) and throughput gains (1.5-2x) far outweigh the marginal quality loss for most applications. Use INT8 or FP16 only for precision-critical tasks. Deploy on GigaGPU dedicated servers. Visit the benchmarks section and infrastructure blog for more data.