Alibaba’s Qwen 2.5 is the most complete open-weight family on the market, spanning 0.5B through 72B dense models with strong multilingual and reasoning performance. On the Blackwell RTX 5060 Ti 16GB you can run every Qwen 2.5 variant up to and including 14B AWQ, and the 14B is genuinely the highlight: it delivers 70 tokens per second on a single card with reasoning quality that rivals Llama 3 70B on multilingual tasks. This post sizes each variant on Gigagpu UK hosting.
Contents
- Qwen 2.5 family
- VRAM and precision
- Throughput table
- Qwen 2.5 14B AWQ highlight
- Use cases by variant
- Deployment
Qwen 2.5 family
| Variant | Params | Context | MMLU | Multilingual | Code (HE) |
|---|---|---|---|---|---|
| Qwen2.5 0.5B | 0.5B | 32k | 47.5 | Good | 30.5 |
| Qwen2.5 1.5B | 1.5B | 32k | 60.9 | Strong | 37.2 |
| Qwen2.5 3B | 3B | 32k | 65.6 | Strong | 48.2 |
| Qwen2.5 7B | 7B | 128k | 74.2 | Excellent | 57.9 |
| Qwen2.5 14B | 14B | 128k | 79.7 | Excellent | 66.7 |
VRAM and precision
| Variant | Precision | Weights | KV (8k) | Total VRAM |
|---|---|---|---|---|
| Qwen2.5 0.5B | FP16 | 1.1 GB | 0.1 GB | 1.5 GB |
| Qwen2.5 1.5B | FP16 | 3.1 GB | 0.2 GB | 3.7 GB |
| Qwen2.5 3B | FP8 | 3.1 GB | 0.3 GB | 3.8 GB |
| Qwen2.5 7B | FP8 | 7.6 GB | 0.9 GB | 9.0 GB |
| Qwen2.5 7B | BF16 | 15.1 GB | 0.9 GB | OOM at 8k |
| Qwen2.5 14B | AWQ int4 | 8.4 GB | 1.8 GB | 10.6 GB |
| Qwen2.5 14B | GPTQ int4 | 8.2 GB | 1.8 GB | 10.4 GB |
Throughput table
Prompt 256 tokens, output 256 tokens, vLLM 0.6 on the 5060 Ti 16GB.
| Variant | BS=1 t/s | BS=8 agg | BS=16 agg | TTFT |
|---|---|---|---|---|
| Qwen2.5 0.5B FP16 | 420 | 1,900 | 3,100 | 9 ms |
| Qwen2.5 1.5B FP16 | 280 | 1,420 | 2,300 | 14 ms |
| Qwen2.5 3B FP8 | 210 | 1,080 | 1,780 | 19 ms |
| Qwen2.5 7B FP8 | 118 | 690 | 1,050 | 34 ms |
| Qwen2.5 14B AWQ | 70 | 310 | 520 | 62 ms |
Qwen 2.5 14B AWQ highlight
Qwen 2.5 14B is the sweet spot for anyone who needs reasoning quality above Llama 3 8B without jumping to 70B-class hardware. At AWQ int4 it occupies roughly 10.6 GB and leaves 4+ GB of KV headroom, which is enough for 32k-token conversations. Benchmark highlights:
- MMLU 79.7 – within 2 points of Llama 3 70B.
- GSM8K 83.5 – strong grade-school maths reasoning.
- HumanEval 66.7 – better coding than Llama 3 8B (59.1).
- C-Eval 82.0 – best-in-class Chinese comprehension under 30B.
- MGSM 64 – multilingual maths across 10 languages.
Use cases by variant
- 0.5B / 1.5B – edge-adjacent routing, ultra-cheap classification, local agents.
- 3B – structured extraction, function-calling, form-filling.
- 7B – general chat, RAG generator, code assistance.
- 14B AWQ – multilingual assistants, reasoning-heavy RAG, complex tool use.
Deployment
# Qwen2.5 14B AWQ
docker run -d --gpus all -p 8000:8000 vllm/vllm-openai:v0.6.3 \
--model Qwen/Qwen2.5-14B-Instruct-AWQ \
--quantization awq_marlin \
--max-model-len 32768 \
--gpu-memory-utilization 0.88 \
--enable-prefix-caching
Run the full Qwen 2.5 family on one Blackwell card
0.5B to 14B AWQ, 128k context, native multilingual. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Qwen 14B benchmark, Qwen VL benchmark, Llama 3 8B benchmark, vLLM setup, prefix caching.