Home / Blog / Benchmarks / Qwen 2.5 72B Tokens/sec by GPU

Benchmarks

Qwen 2.5 72B Tokens/sec by GPU

Benchmark data for Qwen 2.5 72B inference across consumer and professional GPUs, with quantisation comparisons and cost-per-token analysis for dedicated GPU hosting.

Benchmarks April 14, 2026 3 min read admin

Table of Contents

Qwen 2.5 72B Benchmark Overview
Tokens/sec Results by GPU
Quantisation Impact on Speed
Cost Efficiency Analysis
GPU Recommendations
Conclusion

Qwen 2.5 72B Benchmark Overview

Qwen 2.5 72B is one of the largest open-weight language models, rivalling proprietary offerings in reasoning, multilingual support, and code generation. Running a 72B-parameter model requires serious hardware, so choosing the right dedicated GPU server is essential to balancing performance with cost. In this benchmark we test inference speed across six GPUs using vLLM.

At FP16 precision, Qwen 2.5 72B needs approximately 140 GB of VRAM, placing it out of reach for any single consumer GPU. All single-GPU results here use GPTQ 4-bit quantisation (roughly 36 GB), with multi-GPU configurations noted where applicable. For our full benchmark methodology, see the tokens per second benchmark hub.

Tokens/sec Results by GPU

The following table shows Qwen 2.5 72B output speed in tokens per second using INT4 (GPTQ 4-bit) quantisation, which is the most practical precision for single-GPU deployments of this model.

GPU	VRAM	Qwen 2.5 72B INT4 (tok/s)	Notes
RTX 3050	6 GB	N/A	Insufficient VRAM
RTX 4060	8 GB	N/A	Insufficient VRAM
RTX 4060 Ti	16 GB	N/A	Insufficient VRAM even at INT4
RTX 3090	24 GB	N/A	Needs offloading; impractical
RTX 5080	16 GB	N/A	Insufficient VRAM
RTX 5090	32 GB	N/A	Requires partial offload; ~4 tok/s
2x RTX 3090	48 GB	8 tok/s	Tensor parallel across 2 GPUs
2x RTX 5090	64 GB	18 tok/s	Comfortable fit with headroom

A 72B model at INT4 requires approximately 36 GB of VRAM, which means at minimum a dual-GPU setup. The RTX 5090 pair delivers the best single-node performance for this model class.

Quantisation Impact on Speed

Since no single consumer GPU can run Qwen 2.5 72B at FP16, quantisation is mandatory. Below we compare INT4 and INT8 on dual-GPU configurations. For a deep dive into quantisation trade-offs, see our LLaMA 3 8B FP16 vs INT8 vs INT4 analysis.

Configuration	INT8 (tok/s)	INT4 (tok/s)
2x RTX 3090 (48 GB)	5 tok/s	8 tok/s
2x RTX 5090 (64 GB)	13 tok/s	18 tok/s

INT4 provides a roughly 40% speed improvement over INT8 on dual-GPU setups. For tasks demanding the highest accuracy, INT8 is preferable, but most production deployments will find INT4 more than adequate.

Cost Efficiency Analysis

Running a 72B model is expensive. Here we show tokens per second per pound of monthly dedicated hosting cost for the viable configurations.

Configuration	INT4 tok/s	Approx. Monthly Cost	tok/s per Pound
2x RTX 3090	8	~£210	0.038
2x RTX 5090	18	~£480	0.038

Interestingly, both configurations offer similar cost efficiency. The dual RTX 3090 setup is the budget-friendly option, while the dual RTX 5090 is for teams that need lower latency. If a 72B model is overkill, consider the Qwen 2.5 7B benchmark for a more accessible option.

GPU Recommendations

Minimum viable: 2x RTX 3090 — delivers 8 tok/s at INT4, suitable for batch processing and low-traffic APIs.
Recommended: 2x RTX 5090 — 18 tok/s is practical for real-time chatbot applications with moderate concurrency.
Alternative: Consider the best GPU for Qwen guide if you want to explore enterprise GPU options like the RTX 6000 Pro or RTX 6000 Pro.

For smaller model comparisons, check the Mistral Large benchmark or our LLaMA 3 70B results. Browse all results in the Benchmarks category.

Conclusion

Qwen 2.5 72B is a powerhouse model that demands multi-GPU hardware. With INT4 quantisation and a dual RTX 5090 setup, you can achieve responsive inference speeds suitable for production workloads. For teams working within tighter budgets, the dual RTX 3090 configuration provides a solid foundation for development and batch processing.

Multi-GPU Servers for Large Language Models

Deploy Qwen 2.5 72B on dual-GPU dedicated servers with full root access and NVMe storage.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Qwen 2.5 72B Tokens/sec by GPU

Qwen 2.5 72B Benchmark Overview

Tokens/sec Results by GPU

Quantisation Impact on Speed

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Multi-GPU Servers for Large Language Models

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Qwen 2.5 72B Tokens/sec by GPU

Qwen 2.5 72B Benchmark Overview

Tokens/sec Results by GPU

Quantisation Impact on Speed

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Multi-GPU Servers for Large Language Models

Need a Dedicated GPU Server?

admin

Related Articles

Network Latency in AI Serving: Fix

DeepSeek Benchmarks: Performance on GigaGPU Servers

LLaMA 3 8B: FP16 vs INT8 vs INT4 Tokens/sec

Phi-3 Mini on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: phi-3-mini-on-rtx-5080-benchmark, Excerpt: Phi-3 Mini benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?