Every byte of GPU VRAM on the RTX 5060 Ti 16GB at our dedicated GPU hosting either holds model weights, KV cache, activations, or prefix cache. Understanding the budget determines how long a context you can serve and how many concurrent users.
Contents
Components
Of the 16 GB total, after CUDA runtime and cuBLAS/cuDNN handles, you have ~15.3 GB usable. vLLM’s --gpu-memory-utilization 0.90 reserves ~14.1 GB for vLLM itself. That splits into:
- Model weights: fixed cost, depends on quantisation
- Activations / workspace: ~300-800 MB for the forward pass
- KV cache: the rest – shared across all active sequences plus prefix cache
The KV Cache Formula
KV cache bytes per token = 2 * n_layers * n_kv_heads * head_dim * dtype_bytes
For Llama 3.1 8B (32 layers, 8 KV heads, 128 head_dim):
- FP16 KV: 2 * 32 * 8 * 128 * 2 = 131,072 bytes per token = 128 kB/token
- FP8 KV: 64 kB/token
For Qwen 2.5 14B (48 layers, 8 KV heads, 128 head_dim):
- FP16 KV: 192 kB/token
- FP8 KV: 96 kB/token
Concrete Budgets on 16 GB
Assuming --gpu-memory-utilization 0.90 (14.1 GB to vLLM) and 0.5 GB activations:
| Model + Quant | Weights | KV Budget | KV dtype | Tokens (1 seq) | Comfortable max_model_len |
|---|---|---|---|---|---|
| Llama 3.1 8B FP16 | 16 GB | Does not fit | |||
| Llama 3.1 8B FP8 | 8.0 GB | 5.6 GB | FP16 | ~44,800 | 32,768 |
| Llama 3.1 8B FP8 | 8.0 GB | 5.6 GB | FP8 | ~89,600 | 65,536 |
| Llama 3.1 8B AWQ | 5.5 GB | 8.1 GB | FP16 | ~64,800 | 49,152 |
| Qwen 2.5 14B AWQ | 9.0 GB | 4.6 GB | FP16 | ~24,500 | 16,384 |
| Qwen 2.5 14B AWQ | 9.0 GB | 4.6 GB | FP8 | ~49,000 | 32,768 |
| Mistral 7B FP8 | 7.2 GB | 6.4 GB | FP16 | ~51,200 | 32,768 |
| Gemma 9B FP8 | 9.5 GB | 4.1 GB | FP16 | ~32,800 | 16,384 |
Set --max-model-len below the single-seq maximum to leave room for concurrency. A rule of thumb: max_model_len * expected_concurrency should be under the KV budget in tokens.
Stretching the Budget
- FP8 KV cache: doubles tokens you can hold for ~1% quality loss. Always worth it at 16 GB.
- AWQ/GPTQ INT4: cuts weights in half vs FP8 for 7-14B models.
- Lower
--max-num-seqs: fewer concurrent sequences means longer per-sequence context. - GQA/MQA architectures: prefer models with fewer KV heads (Llama 3 has 8, vs Llama 2’s 32) – massively smaller KV.
- Prefix caching: shared prefixes only cost KV once across users.
- Chunked prefill: doesn’t change budget but lets you use it more smoothly.
For ultra-long context (128k), see 128k context deployment. For max model size trade-offs see max model size guide.
16GB of Usable VRAM
Plan your context, concurrency, and model choice carefully. UK dedicated hosting.
Order the RTX 5060 Ti 16GB