Llama 3.1 8B is the most common model to serve on 16 GB. These numbers were measured on the RTX 5060 Ti 16GB at our dedicated GPU hosting with vLLM 0.6.4, CUDA 12.6, driver 560.
Contents
Setup
- GPU: RTX 5060 Ti 16GB (Blackwell, 448 GB/s, 180 W)
- Host: Ryzen 9 7950X, 64 GB DDR5-5600, Gen4 NVMe
- vLLM 0.6.4, CUDA 12.6, PyTorch 2.5, FlashAttention 2.6
- Model: meta-llama/Llama-3.1-8B-Instruct
- Benchmark tool: vLLM’s
benchmark_throughput.py+ custom latency script
Decode Throughput
Steady-state output tokens/sec at 128 input + 512 output, batch 1:
| Precision | Weights | t/s (batch 1) | VRAM used |
|---|---|---|---|
| FP16 | 16.0 GB | Does not fit | |
| FP8 E4M3 | 8.0 GB | 108 | 11.4 GB |
| FP8 + FP8 KV | 8.0 GB | 112 | 9.8 GB |
| AWQ INT4 (Marlin) | 5.5 GB | 135 | 8.2 GB |
| GPTQ INT4 (Marlin) | 5.6 GB | 132 | 8.3 GB |
| GGUF Q4_K_M (llama.cpp) | 4.9 GB | 95 | 7.6 GB |
| EXL2 4.0 bpw (TabbyAPI) | 4.8 GB | 145 | 7.2 GB |
AWQ and EXL2 beat FP8 at batch 1 because INT4 matmul via Marlin kernels is memory-bandwidth-bound, and halving weight size halves the bandwidth hit. At higher batch sizes FP8 catches up and passes INT4.
Prefill Throughput
Input tokens processed per second, single sequence:
| Precision | Prefill t/s |
|---|---|
| FP8 | 6,800 |
| FP8 + FP8 KV | 6,900 |
| AWQ INT4 | 4,200 |
| GGUF Q4_K_M | 3,100 |
| EXL2 4.0 bpw | 5,200 |
FP8 dominates prefill because prefill is compute-bound and Blackwell’s FP8 tensor cores hit peak on large GEMMs.
Concurrency Scaling
FP8 + FP8 KV, 256 in / 512 out, --max-num-seqs auto:
| Concurrent users | Total t/s | Per-user t/s | p99 TTFT |
|---|---|---|---|
| 1 | 112 | 112 | 180 ms |
| 2 | 205 | 103 | 220 ms |
| 4 | 355 | 89 | 310 ms |
| 8 | 510 | 64 | 480 ms |
| 16 | 640 | 40 | 780 ms |
| 32 | 720 | 22 | 1,450 ms |
Aggregate throughput plateaus around 700-720 t/s at batch 32 – hitting memory bandwidth ceiling on Blackwell 16GB.
Time to First Token
Batch 1, FP8:
- 256-token prompt: 120 ms TTFT
- 2,048-token prompt: 380 ms TTFT
- 8,192-token prompt: 1,550 ms TTFT
- 32,768-token prompt: 7,100 ms TTFT (use chunked prefill)
Verdict
Llama 3.1 8B FP8 + FP8 KV at 32k context is the default serving config for this card. ~112 t/s single-user, ~700 t/s aggregate at 32 concurrent. For single-user maximum speed, AWQ or EXL2 4.0 bpw. For long-context, stick with FP8 + FP8 KV cache. Add speculative decoding for interactive chat; prefix caching for fixed system prompts.
Llama 3 8B on Blackwell 16GB
112 t/s solo, 700 t/s aggregate. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: deployment guide, vs RTX 3090.