RTX 3050 - Order Now
Home / Blog / Benchmarks / Qwen 2.5 72B Tokens/sec by GPU
Benchmarks

Qwen 2.5 72B Tokens/sec by GPU

Benchmark data for Qwen 2.5 72B inference across consumer and professional GPUs, with quantisation comparisons and cost-per-token analysis for dedicated GPU hosting.

Qwen 2.5 72B Benchmark Overview

Qwen 2.5 72B is one of the largest open-weight language models, rivalling proprietary offerings in reasoning, multilingual support, and code generation. Running a 72B-parameter model requires serious hardware, so choosing the right dedicated GPU server is essential to balancing performance with cost. In this benchmark we test inference speed across six GPUs using vLLM.

At FP16 precision, Qwen 2.5 72B needs approximately 140 GB of VRAM, placing it out of reach for any single consumer GPU. All single-GPU results here use GPTQ 4-bit quantisation (roughly 36 GB), with multi-GPU configurations noted where applicable. For our full benchmark methodology, see the tokens per second benchmark hub.

Tokens/sec Results by GPU

The following table shows Qwen 2.5 72B output speed in tokens per second using INT4 (GPTQ 4-bit) quantisation, which is the most practical precision for single-GPU deployments of this model.

GPUVRAMQwen 2.5 72B INT4 (tok/s)Notes
RTX 30506 GBN/AInsufficient VRAM
RTX 40608 GBN/AInsufficient VRAM
RTX 4060 Ti16 GBN/AInsufficient VRAM even at INT4
RTX 309024 GBN/ANeeds offloading; impractical
RTX 508016 GBN/AInsufficient VRAM
RTX 509032 GBN/ARequires partial offload; ~4 tok/s
2x RTX 309048 GB8 tok/sTensor parallel across 2 GPUs
2x RTX 509064 GB18 tok/sComfortable fit with headroom

A 72B model at INT4 requires approximately 36 GB of VRAM, which means at minimum a dual-GPU setup. The RTX 5090 pair delivers the best single-node performance for this model class.

Quantisation Impact on Speed

Since no single consumer GPU can run Qwen 2.5 72B at FP16, quantisation is mandatory. Below we compare INT4 and INT8 on dual-GPU configurations. For a deep dive into quantisation trade-offs, see our LLaMA 3 8B FP16 vs INT8 vs INT4 analysis.

ConfigurationINT8 (tok/s)INT4 (tok/s)
2x RTX 3090 (48 GB)5 tok/s8 tok/s
2x RTX 5090 (64 GB)13 tok/s18 tok/s

INT4 provides a roughly 40% speed improvement over INT8 on dual-GPU setups. For tasks demanding the highest accuracy, INT8 is preferable, but most production deployments will find INT4 more than adequate.

Cost Efficiency Analysis

Running a 72B model is expensive. Here we show tokens per second per pound of monthly dedicated hosting cost for the viable configurations.

ConfigurationINT4 tok/sApprox. Monthly Costtok/s per Pound
2x RTX 30908~£2100.038
2x RTX 509018~£4800.038

Interestingly, both configurations offer similar cost efficiency. The dual RTX 3090 setup is the budget-friendly option, while the dual RTX 5090 is for teams that need lower latency. If a 72B model is overkill, consider the Qwen 2.5 7B benchmark for a more accessible option.

GPU Recommendations

  • Minimum viable: 2x RTX 3090 — delivers 8 tok/s at INT4, suitable for batch processing and low-traffic APIs.
  • Recommended: 2x RTX 5090 — 18 tok/s is practical for real-time chatbot applications with moderate concurrency.
  • Alternative: Consider the best GPU for Qwen guide if you want to explore enterprise GPU options like the RTX 6000 Pro or RTX 6000 Pro.

For smaller model comparisons, check the Mistral Large benchmark or our LLaMA 3 70B results. Browse all results in the Benchmarks category.

Conclusion

Qwen 2.5 72B is a powerhouse model that demands multi-GPU hardware. With INT4 quantisation and a dual RTX 5090 setup, you can achieve responsive inference speeds suitable for production workloads. For teams working within tighter budgets, the dual RTX 3090 configuration provides a solid foundation for development and batch processing.

Multi-GPU Servers for Large Language Models

Deploy Qwen 2.5 72B on dual-GPU dedicated servers with full root access and NVMe storage.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?