FP8 is the best default precision for Llama 3 8B on the RTX 5060 Ti 16GB. Native Blackwell FP8 tensor cores deliver near-FP16 quality at half the memory and nearly double the throughput. Here is the full deployment walkthrough on our hosting.
Contents
Checkpoint
Recommended FP8 checkpoints for Llama:
neuralmagic/Llama-3.1-8B-Instruct-FP8– static FP8 quantisation, best compatibilityneuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic– dynamic activation quantisation, slightly higher qualityneuralmagic/Meta-Llama-3.3-70B-Instruct-FP8– 70B FP8 (needs bigger card)
For Llama 3 8B, the dynamic variant typically performs best on the 5060 Ti.
Launch
python -m vllm.entrypoints.openai.api_server \
--model neuralmagic/Llama-3.1-8B-Instruct-FP8 \
--quantization fp8 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--served-model-name llama-3.1-8b
Model load takes ~30-60 seconds on fast NVMe, longer on first run when checkpoint downloads.
Verify
Check startup logs for confirmation:
[INFO] Loading fp8 checkpoint
[INFO] Using Blackwell FP8 tensor cores
Weight dtype: torch.float8_e4m3fn
If you see fallback to FP16 (torch.float16), the FP8 path is not active. Check:
- vLLM version – need 0.6.0+ for solid Blackwell FP8 support
- CUDA version – 12.4+
- Driver – 565+
--quantization fp8flag explicitly present
Tune
--max-num-seqs 24for ~14-user concurrency target--max-num-batched-tokens 8192for prefill efficiency--enable-chunked-prefillif mixing short chat with long RAG prompts--kv-cache-dtype fp8to double KV cache capacity
See full vLLM batching tuning.
Troubleshooting
- OOM on load: reduce
max-model-lenorgpu-memory-utilization - Slow decode: check Blackwell tensor cores active; verify persistence mode on
- Tokeniser issues: some FP8 checkpoints use custom tokenisers –
--trust-remote-codemay be needed
FP8 Llama Preconfigured
Blackwell 16GB with FP8 Llama ready to serve. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: FP8 deep dive, Llama 3 8B fit.