FP8 KV cache halves the memory cost of every token’s attention state. On the RTX 5060 Ti 16GB at our hosting, this is one of the highest-value flags you can enable. Blackwell’s tensor cores support FP8 natively, so there is no software-emulation overhead.
Contents
What It Does
vLLM stores attention K and V tensors for every token in the active sequence plus the prefix cache. Default is FP16 (2 bytes/scalar). --kv-cache-dtype fp8 stores these as FP8 (1 byte/scalar), cutting KV memory in half.
Blackwell supports both FP8 formats. vLLM defaults to E4M3 for KV cache – wider dynamic range, best quality. E5M2 is available but usually unnecessary for KV.
Enabling FP8 KV Cache
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--kv-cache-dtype fp8 \
--max-model-len 65536 \
--gpu-memory-utilization 0.92
On Blackwell this is fast-path – the KV read/write uses native tensor core instructions. No emulation cost.
Quality Impact
Measured MMLU delta on Llama 3.1 8B-Instruct across 500 questions:
| Config | Weights | KV | MMLU | Delta vs baseline |
|---|---|---|---|---|
| Baseline | FP16 | FP16 | 68.4 | – |
| FP8 weights, FP16 KV | FP8 E4M3 | FP16 | 68.3 | -0.1 |
| FP8 weights, FP8 KV | FP8 E4M3 | FP8 E4M3 | 68.0 | -0.4 |
| AWQ INT4, FP8 KV | INT4 | FP8 E4M3 | 67.5 | -0.9 |
For most real workloads the quality hit is invisible. Long-context retrieval tasks (needle-in-a-haystack style) can show slightly higher error at extreme lengths (>64k tokens) because attention scores accumulate FP8 rounding across many keys – in those cases keep KV in FP16.
Capacity Gains
| Model | Weights | FP16 KV max_len | FP8 KV max_len | Gain |
|---|---|---|---|---|
| Llama 3.1 8B FP8 | 8.0 GB | 32,768 | 65,536 | 2.0x |
| Llama 3.1 8B AWQ | 5.5 GB | 49,152 | 98,304 | 2.0x |
| Qwen 2.5 14B AWQ | 9.0 GB | 16,384 | 32,768 | 2.0x |
| Mistral Nemo 12B FP8 | 12.5 GB | 8,192 | 24,576 | 3.0x |
Mistral Nemo’s gain is larger than 2x because its weights dominate memory so strongly that halving KV disproportionately grows what’s left. Same effect for any large quantised model on tight hardware.
Compatibility
- vLLM 0.5+: fully supported on Blackwell with
--kv-cache-dtype fp8 - Speculative decoding: works with FP8 KV
- Prefix caching: works – caches blocks as FP8 too
- Chunked prefill: orthogonal, no interaction
- FlashAttention: Blackwell fast path supports FP8 K, V natively
Recommendation: enable --kv-cache-dtype fp8 by default on every deployment except those serving ultra-long retrieval-critical contexts.
Double Your Context on Blackwell 16GB
FP8 KV cache for ~1% quality cost. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: FP8 Llama deployment, context budget, 128k context, prefix caching, chunked prefill.