Qwen 2.5 14B punches above its weight on reasoning benchmarks while staying single-card friendly. The RTX 5060 Ti 16GB at our hosting is a strong match via AWQ quantisation.
Contents
Fit
| Precision | Weights | KV Cache Room |
|---|---|---|
| FP16 | ~28 GB | Does not fit |
| FP8 | ~14 GB | ~2 GB – tight |
| AWQ INT4 | ~8 GB | ~8 GB – comfortable |
| GPTQ INT4 | ~8 GB | ~8 GB – comfortable |
AWQ is the practical production choice. FP8 technically fits but leaves little KV cache room for concurrency.
Deployment
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct-AWQ \
--quantization awq \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching
Performance
| Metric | AWQ |
|---|---|
| Batch 1 decode | ~44 t/s |
| Batch 4 aggregate | ~155 t/s |
| Batch 8 aggregate | ~240 t/s |
| Batch 16 aggregate | ~380 t/s |
| TTFT 1k prompt | ~280 ms |
Reasonable for 6-10 concurrent users at chat SLAs. For higher concurrency on 14B step up to the 5080 or 3090.
vs Smaller Alternatives
| Model | MMLU | Speed on 5060 Ti |
|---|---|---|
| Mistral 7B FP8 | ~66 | ~110 t/s |
| Qwen 2.5 7B AWQ | ~71 | ~100 t/s |
| Llama 3 8B FP8 | ~70 | ~105 t/s |
| Qwen 2.5 14B AWQ | ~77 | ~44 t/s |
~6 points of MMLU quality at roughly half the speed per user. Pick 14B when quality matters more than concurrency.
Qwen 14B on Single Card
The step up in reasoning from 7B. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: monthly cost, Qwen Coder 14B.