Table of Contents
Qwen 2.5 Quantisation Overview
Qwen 2.5 is available in sizes from 0.5B to 72B parameters, and quantisation is the key to running these models cost-effectively on a dedicated GPU server. This guide benchmarks the 7B and 72B variants across GPTQ, AWQ, and GGUF formats to help you pick the right combination for your hardware and quality requirements.
For baseline hosting options see our Qwen hosting page. For VRAM requirements at different context lengths, check the Qwen 2.5 context length VRAM guide.
Qwen 2.5 7B Speed by Format
Benchmarks run on GigaGPU servers with 512 input tokens and 256 output tokens. GPTQ and AWQ use vLLM, GGUF uses llama.cpp with full GPU offload.
| GPU | FP16 (tok/s) | GPTQ 4-bit (tok/s) | AWQ 4-bit (tok/s) | GGUF Q4_K_M (tok/s) |
|---|---|---|---|---|
| RTX 4060 (8 GB) | N/A | 26 | 24 | 20 |
| RTX 4060 Ti (16 GB) | 32 | 46 | 43 | 36 |
| RTX 3090 (24 GB) | 43 | 60 | 57 | 48 |
| RTX 5080 (16 GB) | 68 | 90 | 86 | 72 |
| RTX 5090 (32 GB) | 95 | 128 | 122 | 100 |
Performance patterns mirror other 7B models: GPTQ leads by 3-5%, AWQ sits close behind, and GGUF lags by 20-25%. See our LLaMA 3 8B speed comparison for a direct cross-model benchmark.
Qwen 2.5 72B Speed by Format
The 72B model requires multi-GPU setups at all precision levels. Tensor parallelism via vLLM is used for all GPU configurations.
| GPU Config | Total VRAM | FP16 (tok/s) | GPTQ 4-bit (tok/s) | AWQ 4-bit (tok/s) |
|---|---|---|---|---|
| 2x RTX 3090 | 48 GB | N/A | 12 | 11 |
| 2x RTX 5090 | 64 GB | N/A | 22 | 21 |
| 4x RTX 3090 | 96 GB | N/A | 18 | 17 |
| 2x RTX 6000 Pro 96 GB | 160 GB | 20 | 32 | 30 |
| 4x RTX 6000 Pro 96 GB | 320 GB | 35 | 48 | 46 |
INT4 quantisation makes 72B accessible on consumer hardware — 2x RTX 5090 (64 GB) delivers 22 tok/s with GPTQ. For deployment strategies, see our model sharding guide.
Quality Comparison
Quality measured as percentage of FP16 baseline scores across coding, reasoning, and general benchmarks.
| Model | Format | Coding | Reasoning | General |
|---|---|---|---|---|
| 7B | GPTQ 4-bit | 95% | 96% | 97% |
| 7B | AWQ 4-bit | 96% | 97% | 97% |
| 7B | GGUF Q4_K_M | 96% | 97% | 98% |
| 72B | GPTQ 4-bit | 96% | 97% | 98% |
| 72B | AWQ 4-bit | 97% | 97% | 98% |
The 72B model retains quality better than the 7B under quantisation — larger models have more redundancy in their weight matrices. AWQ edges out GPTQ on quality by 1-2% across all model sizes.
Format Recommendations
- Qwen 2.5 7B on single GPU: use GPTQ 4-bit for speed or AWQ 4-bit for quality. Both fit on 8 GB GPUs with room for 8K context.
- Qwen 2.5 72B on multi-GPU: AWQ 4-bit recommended — quality preservation matters more at this scale, and speed difference vs GPTQ is minimal.
- High-concurrency serving: GPTQ 4-bit with continuous batching maximises throughput per GPU dollar.
- CPU/GPU hybrid: GGUF for the 7B model only — the 72B is impractical for partial offloading due to size.
- Quality-critical tasks: INT8 or FP16 if VRAM permits. See FP16 vs INT8 vs INT4 for trade-off details.
For throughput tuning, review our vLLM optimisation guide. Browse all benchmarks in the Benchmarks category.
Conclusion
Qwen 2.5 quantises well at both the 7B and 72B scale. GPTQ 4-bit delivers peak speed, AWQ 4-bit offers marginally better quality, and both cut VRAM by 70%+ compared to FP16. For production deployments, match the format to your serving stack — vLLM users should prefer GPTQ or AWQ, while llama.cpp users should stick with GGUF. Check our tokens per second benchmark hub for broader comparisons.
Run Qwen 2.5 at Any Scale
From single-GPU 7B deployments to multi-GPU 72B clusters. Pre-configured for vLLM with GPTQ and AWQ support.
Browse GPU Servers