The 14B slot in Qwen 2.5 lands between the 7B and 32B in capability and comfortably fits a 16GB RTX 5080 on our dedicated hosting at INT8. It punches well above its size on reasoning and coding benchmarks while remaining single-GPU friendly.
Contents
VRAM Footprint
| Precision | Weights | Total with KV cache (8k ctx) |
|---|---|---|
| FP16 | ~28 GB | Does not fit on 16 GB |
| FP8 | ~14 GB | Tight on 16 GB |
| AWQ INT4 | ~8 GB | Comfortable with batching |
| GPTQ INT4 | ~8 GB | Comfortable with batching |
Setup
AWQ is the sweet spot – good quality, good speed, plenty of room for concurrent sequences:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct-AWQ \
--quantization awq \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching
For FP8 (Blackwell native):
--model neuralmagic/Qwen2.5-14B-Instruct-FP8 --quantization fp8
Performance
On RTX 5080 16 GB:
| Scenario | Throughput |
|---|---|
| AWQ INT4, batch 1 | ~85 t/s |
| AWQ INT4, batch 8 | ~420 t/s aggregate |
| AWQ INT4, batch 32 | ~780 t/s aggregate |
| FP8, batch 1 | ~65 t/s |
| FP8, batch 8 | ~320 t/s aggregate |
Qwen 2.5 14B on a Single Blackwell Card
RTX 5080 UK dedicated servers preconfigured for Qwen AWQ or FP8.
Browse GPU ServersVs Alternatives
Qwen 2.5 14B beats Llama 3 8B on most reasoning tasks at the cost of about 30% more VRAM. It trails Qwen 2.5 32B meaningfully but fits single-GPU where the 32B needs more capacity. For the next step up see Qwen Coder 32B, and for the flagship see Qwen 2.5 72B deployment.
See also B70 vs RTX 5080 for LLM serving – the B70’s 32 GB lets you run the 14B at FP16 without quantising.