Table of Contents
FP8 quantisation on Blackwell hardware doubles throughput. The question is whether the quality cost is real. This page is the actual measurement across five production models.
Across Llama 3.1 8B, Mistral 7B, Qwen 2.5 14B, Phi-3 Medium, and Gemma 2 9B: FP8 (E4M3) loses 0.2-1.1% on standard benchmarks vs FP16. Negligible for most production workloads. Use FP8 by default on Blackwell.
Methodology
- Benchmarks: MMLU, MATH, HumanEval, MMLU-Pro, GSM8K
- FP8 mode: dynamic E4M3 via vLLM
- 3 random seeds per model per precision
- RTX 5090 32 GB, vLLM 0.6.3
Results across five models
| Model | FP16 avg | FP8 avg | Delta |
|---|---|---|---|
| Llama 3.1 8B Instruct | 63.2 | 62.8 | -0.4% |
| Mistral 7B v0.3 | 60.1 | 59.7 | -0.7% |
| Qwen 2.5 14B | 69.4 | 68.9 | -0.7% |
| Phi-3 Medium | 64.8 | 64.0 | -1.2% |
| Gemma 2 9B | 61.3 | 60.9 | -0.7% |
Average score across MMLU, MATH, HumanEval, MMLU-Pro, GSM8K.
Where FP8 matters most
- High-volume chatbots: 50% throughput uplift, <1% quality drop. Free win.
- Latency-sensitive single-stream: same prefill speedup, lower TTFT.
- Multi-model deployments: half the VRAM lets you run more models concurrently.
Where FP8 matters less:
- Hardest reasoning tasks (MATH-hard, ARC-hard) — quality drop can reach 2-3%
- Frontier-quality benchmarks where the last 1% matters
Verdict
For 95% of production deployments, FP8 is the right default on Blackwell. The quality cost is real but small; the throughput gain is large. Skip FP8 only when publishing benchmarks or running quality-critical reasoning workloads.
Bottom line
Use FP8 by default. See cost per 1M tokens for the throughput-side benefit.