RTX 3050 - Order Now
Home / Blog / Model Guides / Qwen 2.5 72B Self-Hosted Deployment
Model Guides

Qwen 2.5 72B Self-Hosted Deployment

Qwen 2.5 72B is arguably the best open-weights 72B class model for English and Chinese. Here is what hosting it on a dedicated GPU actually requires.

Qwen 2.5 72B is one of the top-tier open-weights models in 2026, competitive with Llama 3 70B on English reasoning and clearly ahead on Chinese tasks. Self-hosting it on our dedicated GPU hosting takes the same class of hardware as Llama 3 70B but with a few format-specific details to get right.

Contents

VRAM Requirements

PrecisionWeightsRecommended Total
FP16~144 GBMulti-GPU only
FP8~72 GB96 GB single card comfortable
AWQ INT4~42 GB48 GB or two 24 GB cards
GPTQ INT4~42 GB48 GB or two 24 GB cards

GPU Options

  • RTX 6000 Pro 96GB: single-card FP8 with real headroom. Best single-GPU option.
  • Two RTX 5090s: tensor-parallel INT4 with decent concurrency.
  • Two RTX 3090s: budget tensor-parallel INT4.

Deployment

On a 6000 Pro via vLLM:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 \
  --quantization gptq \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.93 \
  --enable-prefix-caching

On two 5090s via tensor parallel:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-72B-Instruct-AWQ \
  --quantization awq \
  --tensor-parallel-size 2 \
  --max-model-len 8192

Performance

ConfigurationBatch 1 t/sBatch 16 t/s
6000 Pro, AWQ INT4~42~480
6000 Pro, FP8~36~420
2× 5090 TP, AWQ~28~400
2× 3090 TP, AWQ~22~320

Host Qwen 2.5 72B on UK Dedicated

Single-card 96GB or tensor-parallel 5090 servers – we provision either same day.

Browse GPU Servers

For smaller Qwen variants see Qwen 2.5 14B on 5080 and Qwen Coder 32B. For the head-to-head comparison see Llama 3.3 70B on 6000 Pro.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?