Qwen 2.5 72B is one of the top-tier open-weights models in 2026, competitive with Llama 3 70B on English reasoning and clearly ahead on Chinese tasks. Self-hosting it on our dedicated GPU hosting takes the same class of hardware as Llama 3 70B but with a few format-specific details to get right.
Contents
VRAM Requirements
| Precision | Weights | Recommended Total |
|---|---|---|
| FP16 | ~144 GB | Multi-GPU only |
| FP8 | ~72 GB | 96 GB single card comfortable |
| AWQ INT4 | ~42 GB | 48 GB or two 24 GB cards |
| GPTQ INT4 | ~42 GB | 48 GB or two 24 GB cards |
GPU Options
- RTX 6000 Pro 96GB: single-card FP8 with real headroom. Best single-GPU option.
- Two RTX 5090s: tensor-parallel INT4 with decent concurrency.
- Two RTX 3090s: budget tensor-parallel INT4.
Deployment
On a 6000 Pro via vLLM:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 \
--quantization gptq \
--max-model-len 16384 \
--gpu-memory-utilization 0.93 \
--enable-prefix-caching
On two 5090s via tensor parallel:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct-AWQ \
--quantization awq \
--tensor-parallel-size 2 \
--max-model-len 8192
Performance
| Configuration | Batch 1 t/s | Batch 16 t/s |
|---|---|---|
| 6000 Pro, AWQ INT4 | ~42 | ~480 |
| 6000 Pro, FP8 | ~36 | ~420 |
| 2× 5090 TP, AWQ | ~28 | ~400 |
| 2× 3090 TP, AWQ | ~22 | ~320 |
Host Qwen 2.5 72B on UK Dedicated
Single-card 96GB or tensor-parallel 5090 servers – we provision either same day.
Browse GPU ServersFor smaller Qwen variants see Qwen 2.5 14B on 5080 and Qwen Coder 32B. For the head-to-head comparison see Llama 3.3 70B on 6000 Pro.