The 16 GB VRAM on the RTX 5060 Ti 16GB caps which models you can host on our dedicated hosting. Here are the ceilings by precision with concrete examples.
Contents
FP16
Model size in bytes ≈ 2× parameter count. 16 GB hosts up to ~7-8B at FP16 with KV cache room.
- Phi-3-mini 3.8B: 8 GB – easy, huge KV room
- Mistral 7B: 14 GB – tight, FP8 preferred
- Llama 3 8B: 16 GB – does not fit with KV cache, use FP8
- Qwen 7B: 14 GB – tight, FP8 preferred
- Gemma 2 9B: 18 GB – does not fit
FP8
FP8 halves weight size. 16 GB hosts models up to ~14-15B at FP8:
- Llama 3 8B: 8 GB – comfortable
- Mistral 7B: 7 GB – very comfortable
- Gemma 2 9B: 9 GB – comfortable
- Mistral Nemo 12B: 12 GB – fits, tight KV
- Qwen 14B: 14 GB – tight but works
- Phi-3 medium 14B: 14 GB – tight
INT4 (AWQ/GPTQ)
INT4 quarters weight size. 16 GB hosts models up to ~30B at INT4:
- Qwen 14B: 8 GB – very comfortable
- Codestral 22B: 13 GB – tight but works
- Gemma 27B: 16 GB – barely fits, short context only
- 30B dense: edge cases only, context must be minimal
- Mixtral 8x7B: does not fit (47B total)
- Llama 3 70B: does not fit at any workable precision
KV Cache
VRAM after weights = KV cache capacity. For each model the production sweet spot balances weights with enough KV for target concurrency.
| Model / Precision | Weights | KV for 10 users at 8k | Fit |
|---|---|---|---|
| Llama 3 8B FP8 | 8 GB | ~5 GB | Comfortable |
| Qwen 14B AWQ | 8 GB | ~8 GB | Comfortable |
| Mistral Nemo 12B AWQ | 7 GB | ~7 GB (FP8 KV) | Comfortable at 32k |
| Codestral 22B AWQ | 13 GB | ~2 GB | Very tight |
Picking
For production, FP8 is the best general default. AWQ INT4 gives more headroom when concurrency matters more than raw quality. FP16 only makes sense for sub-8B models where precision is critical.
Know Your Ceiling
The 5060 Ti 16GB handles 7-15B class well at FP8. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: context budget, FP8 KV cache.