Qwen 2.5 14B via AWQ or GPTQ INT4 is the biggest model that serves cleanly on 16 GB. Here are the full benchmarks on the RTX 5060 Ti 16GB at our hosting.
Contents
Setup
- Model: Qwen/Qwen2.5-14B-Instruct-AWQ
- 48 layers, 8 KV heads (GQA), 128 head dim
- vLLM 0.6.4, Marlin AWQ kernels, FlashAttention 2.6
Decode Throughput
128 in / 512 out, batch 1:
| Precision | VRAM (weights) | t/s | Max context |
|---|---|---|---|
| FP16 | 28 GB | Does not fit | |
| FP8 | 14 GB | Does not fit with KV | |
| AWQ INT4 (Marlin) | 9.0 GB | 68 | 16,384 |
| AWQ INT4 + FP8 KV | 9.0 GB | 70 | 32,768 |
| GPTQ INT4 | 9.2 GB | 65 | 16,384 |
| GGUF Q4_K_M | 8.8 GB | 55 | 16,384 |
| EXL2 4.0 bpw | 8.2 GB | 75 | 16,384 |
14B decode on 16 GB is memory-bandwidth bound at 70 t/s – less than the 112 t/s from 8B but MMLU is meaningfully higher (~74 vs 68).
Prefill Throughput
- AWQ INT4: 2,100 input t/s
- GPTQ INT4: 2,000 input t/s
- GGUF Q4_K_M: 1,600 input t/s
- EXL2 4.0 bpw: 2,600 input t/s
Concurrency
| Users | Total t/s (AWQ+FP8 KV) | Per user | p99 TTFT |
|---|---|---|---|
| 1 | 70 | 70 | 340 ms |
| 2 | 125 | 62 | 420 ms |
| 4 | 205 | 51 | 560 ms |
| 8 | 280 | 35 | 850 ms |
| 16 | 320 | 20 | 1,600 ms |
Drops to batch 16 effectively because KV cache is tight. For production concurrency beyond 8, move to 5080 16 GB or 3090 24 GB.
Context Length
Qwen 2.5 14B native context is 128k (YaRN extended). Practical budgets on 16 GB:
- AWQ + FP16 KV: 16k max
- AWQ + FP8 KV: 32k max
- AWQ + FP8 KV + max-num-seqs 1: 64k possible
Verdict
Qwen 2.5 14B AWQ + FP8 KV at 32k is the sweet spot. Strong model (beats Llama 3 8B on reasoning, code, multilingual), solid 70 t/s decode, reasonable concurrency up to 8 users.
Qwen 2.5 14B on Blackwell 16GB
Biggest model that fits, 70 t/s decode. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: monthly cost, max model size, AWQ guide, FP8 KV cache, Llama 3 8B comparison.