Table of Contents
vLLM's --kv-cache-dtype fp8_e5m2 halves KV cache memory at the cost of some numerical precision in attention. The internet has opinions; here are real measurements on standard benchmarks.
FP8 KV cache vs FP16 KV cache, on Llama 3.1 8B FP8 weights: MMLU drop ~0.3%, HumanEval drop ~0.5%, GSM8K drop ~0.8%, MT-Bench drop ~0.1 (out of 10). Memory savings: ~50% on KV cache. For production, FP8 KV cache is essentially free quality-wise; the memory saving is decisive.
Setup
- vLLM 0.6.4, Llama 3.1 8B FP8 weights
- Compared
fp16KV cache vsfp8_e5m2KV cache - Standard benchmarks: MMLU, HumanEval, GSM8K, MT-Bench
- Same temperature (0 for deterministic), same prompts, 5 runs averaged
Results
| Benchmark | FP16 KV | FP8 KV | Drop |
|---|---|---|---|
| MMLU | 68.5% | 68.2% | ~0.3% |
| HumanEval | 66.5% | 66.0% | ~0.5% |
| GSM8K | 78.4% | 77.6% | ~0.8% |
| MT-Bench (out of 10) | 7.85 | 7.75 | ~0.1 |
| Multi-turn coherence (1-5) | 4.12 | 4.10 | ~0.02 |
Interpretation
- MMLU and MT-Bench: drops within noise of single-run variance. Effectively no impact.
- HumanEval: ~0.5% drop — one or two test cases out of 164 flip. Measurable but small.
- GSM8K (math): ~0.8% drop. The largest impact, consistent with attention precision mattering more for arithmetic.
- Multi-turn coherence: no measurable impact at this scale.
The pattern: FP8 KV cache costs ~0.5% on standard benchmarks. The 50% memory saving (which translates to roughly doubling serving capacity at the same context length) is decisive in production.
Verdict
For production deployments, FP8 KV cache should be always-on. The quality cost is < 1% on standard benchmarks; the memory saving doubles serving capacity. The only workloads where FP16 KV cache is worth keeping: research benchmarking where every fraction of a percent matters.
Bottom line
FP8 KV cache is essentially free quality. See KV cache.