RTX 3050 - Order Now
Home / Blog / Benchmarks / FP8 vs FP16 LLM Inference: Real Quality Comparison Across Five Models
Benchmarks

FP8 vs FP16 LLM Inference: Real Quality Comparison Across Five Models

Hardware FP8 on Blackwell promises 2× throughput at minimal quality cost. We measured the actual quality drop across five popular open-weight models.

FP8 quantisation on Blackwell hardware doubles throughput. The question is whether the quality cost is real. This page is the actual measurement across five production models.

TL;DR

Across Llama 3.1 8B, Mistral 7B, Qwen 2.5 14B, Phi-3 Medium, and Gemma 2 9B: FP8 (E4M3) loses 0.2-1.1% on standard benchmarks vs FP16. Negligible for most production workloads. Use FP8 by default on Blackwell.

Methodology

  • Benchmarks: MMLU, MATH, HumanEval, MMLU-Pro, GSM8K
  • FP8 mode: dynamic E4M3 via vLLM
  • 3 random seeds per model per precision
  • RTX 5090 32 GB, vLLM 0.6.3

Results across five models

ModelFP16 avgFP8 avgDelta
Llama 3.1 8B Instruct63.262.8-0.4%
Mistral 7B v0.360.159.7-0.7%
Qwen 2.5 14B69.468.9-0.7%
Phi-3 Medium64.864.0-1.2%
Gemma 2 9B61.360.9-0.7%

Average score across MMLU, MATH, HumanEval, MMLU-Pro, GSM8K.

Where FP8 matters most

  • High-volume chatbots: 50% throughput uplift, <1% quality drop. Free win.
  • Latency-sensitive single-stream: same prefill speedup, lower TTFT.
  • Multi-model deployments: half the VRAM lets you run more models concurrently.

Where FP8 matters less:

  • Hardest reasoning tasks (MATH-hard, ARC-hard) — quality drop can reach 2-3%
  • Frontier-quality benchmarks where the last 1% matters

Verdict

For 95% of production deployments, FP8 is the right default on Blackwell. The quality cost is real but small; the throughput gain is large. Skip FP8 only when publishing benchmarks or running quality-critical reasoning workloads.

Bottom line

Use FP8 by default. See cost per 1M tokens for the throughput-side benefit.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?