The RTX 5060 Ti 16GB inherits Blackwell’s 5th-gen tensor cores with native FP8 support. For AI serving on our dedicated GPU hosting this is one of the most practical features at the mid-tier price point. Here is what FP8 actually delivers.
Contents
What FP8 Is
FP8 is an 8-bit floating-point format. Two variants in widespread use:
- E4M3: 4 exponent bits, 3 mantissa. Better precision, smaller range. Used for weights and activations in forward pass.
- E5M2: 5 exponent bits, 2 mantissa. Wider range, less precision. Used for gradients in training.
Blackwell tensor cores accelerate both natively. FP8 lives in the spec sheet at ~400 TFLOPS on the 5060 Ti – twice the FP16 rate.
Memory Savings
FP8 weights are half the size of FP16. Concrete examples:
| Model | FP16 | FP8 | KV Cache Room on 16 GB card |
|---|---|---|---|
| Mistral 7B | 14 GB | 7 GB | FP8: 9 GB; FP16: 2 GB |
| Llama 3 8B | 16 GB | 8 GB | FP8: 8 GB; FP16: 0 GB |
| Qwen 2.5 14B | 28 GB | 14 GB | FP8: 2 GB; FP16: does not fit |
Speed
With FP8 tensor cores running native, throughput roughly matches or beats FP16 while using half the memory. On Mistral 7B:
- FP16 decode on 5060 Ti: ~65 t/s (tight VRAM limits batching)
- FP8 decode on 5060 Ti: ~110 t/s
The FP8 win is ~70% faster on Mistral 7B not because the raw matmul is faster, but because reduced memory traffic + native tensor core path + more KV cache for concurrent batching stack up.
Quality
FP8 quality degradation on perplexity benchmarks is typically 0.3-0.8% versus FP16. On downstream eval (MMLU, HumanEval), often indistinguishable. For most production workloads users cannot tell the difference.
Exceptions where quality loss is visible:
- Very long-context reasoning with accumulated error
- Precision-critical math or finance workloads
- Models not specifically trained/fine-tuned with FP8 in mind
Checkpoints
Pre-quantised FP8 checkpoints widely available:
neuralmagic/Llama-3.1-8B-Instruct-FP8neuralmagic/Llama-3.3-70B-Instruct-FP8(needs larger card)neuralmagic/Mistral-7B-Instruct-FP8neuralmagic/Qwen2.5-14B-Instruct-FP8- Diffusion: FLUX Schnell in FP8
Deployment
vLLM:
python -m vllm.entrypoints.openai.api_server \
--model neuralmagic/Llama-3.1-8B-Instruct-FP8 \
--quantization fp8 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92
TGI:
text-generation-launcher \
--model-id neuralmagic/Mistral-7B-Instruct-FP8 \
--quantize fp8
For more format options see quantisation formats compared.
See also: FP8 Llama deployment, 5th-gen tensor cores.