Table of Contents
Qwen 2.5 72B Benchmark Overview
Qwen 2.5 72B is one of the largest open-weight language models, rivalling proprietary offerings in reasoning, multilingual support, and code generation. Running a 72B-parameter model requires serious hardware, so choosing the right dedicated GPU server is essential to balancing performance with cost. In this benchmark we test inference speed across six GPUs using vLLM.
At FP16 precision, Qwen 2.5 72B needs approximately 140 GB of VRAM, placing it out of reach for any single consumer GPU. All single-GPU results here use GPTQ 4-bit quantisation (roughly 36 GB), with multi-GPU configurations noted where applicable. For our full benchmark methodology, see the tokens per second benchmark hub.
Tokens/sec Results by GPU
The following table shows Qwen 2.5 72B output speed in tokens per second using INT4 (GPTQ 4-bit) quantisation, which is the most practical precision for single-GPU deployments of this model.
| GPU | VRAM | Qwen 2.5 72B INT4 (tok/s) | Notes |
|---|---|---|---|
| RTX 3050 | 6 GB | N/A | Insufficient VRAM |
| RTX 4060 | 8 GB | N/A | Insufficient VRAM |
| RTX 4060 Ti | 16 GB | N/A | Insufficient VRAM even at INT4 |
| RTX 3090 | 24 GB | N/A | Needs offloading; impractical |
| RTX 5080 | 16 GB | N/A | Insufficient VRAM |
| RTX 5090 | 32 GB | N/A | Requires partial offload; ~4 tok/s |
| 2x RTX 3090 | 48 GB | 8 tok/s | Tensor parallel across 2 GPUs |
| 2x RTX 5090 | 64 GB | 18 tok/s | Comfortable fit with headroom |
A 72B model at INT4 requires approximately 36 GB of VRAM, which means at minimum a dual-GPU setup. The RTX 5090 pair delivers the best single-node performance for this model class.
Quantisation Impact on Speed
Since no single consumer GPU can run Qwen 2.5 72B at FP16, quantisation is mandatory. Below we compare INT4 and INT8 on dual-GPU configurations. For a deep dive into quantisation trade-offs, see our LLaMA 3 8B FP16 vs INT8 vs INT4 analysis.
| Configuration | INT8 (tok/s) | INT4 (tok/s) |
|---|---|---|
| 2x RTX 3090 (48 GB) | 5 tok/s | 8 tok/s |
| 2x RTX 5090 (64 GB) | 13 tok/s | 18 tok/s |
INT4 provides a roughly 40% speed improvement over INT8 on dual-GPU setups. For tasks demanding the highest accuracy, INT8 is preferable, but most production deployments will find INT4 more than adequate.
Cost Efficiency Analysis
Running a 72B model is expensive. Here we show tokens per second per pound of monthly dedicated hosting cost for the viable configurations.
| Configuration | INT4 tok/s | Approx. Monthly Cost | tok/s per Pound |
|---|---|---|---|
| 2x RTX 3090 | 8 | ~£210 | 0.038 |
| 2x RTX 5090 | 18 | ~£480 | 0.038 |
Interestingly, both configurations offer similar cost efficiency. The dual RTX 3090 setup is the budget-friendly option, while the dual RTX 5090 is for teams that need lower latency. If a 72B model is overkill, consider the Qwen 2.5 7B benchmark for a more accessible option.
GPU Recommendations
- Minimum viable: 2x RTX 3090 — delivers 8 tok/s at INT4, suitable for batch processing and low-traffic APIs.
- Recommended: 2x RTX 5090 — 18 tok/s is practical for real-time chatbot applications with moderate concurrency.
- Alternative: Consider the best GPU for Qwen guide if you want to explore enterprise GPU options like the RTX 6000 Pro or RTX 6000 Pro.
For smaller model comparisons, check the Mistral Large benchmark or our LLaMA 3 70B results. Browse all results in the Benchmarks category.
Conclusion
Qwen 2.5 72B is a powerhouse model that demands multi-GPU hardware. With INT4 quantisation and a dual RTX 5090 setup, you can achieve responsive inference speeds suitable for production workloads. For teams working within tighter budgets, the dual RTX 3090 configuration provides a solid foundation for development and batch processing.
Multi-GPU Servers for Large Language Models
Deploy Qwen 2.5 72B on dual-GPU dedicated servers with full root access and NVMe storage.
Browse GPU Servers