Table of Contents
Qwen 2.5 in April 2026
Qwen 2.5 72B from Alibaba Cloud has emerged as one of the most versatile open-source LLMs in April 2026. Under an Apache 2.0 license with unrestricted commercial use, it delivers strong performance across English, Chinese, and dozens of other languages. Its function-calling capabilities make it particularly well-suited for AI agent deployments. This report captures performance data from GigaGPU dedicated servers.
Throughput Benchmarks by GPU
Qwen 2.5 72B via vLLM at 10 concurrent users:
| GPU Configuration | Precision | Total tok/s | First Token | VRAM Used |
|---|---|---|---|---|
| 1x RTX 5090 | Q4 (AWQ) | 58 | 150 ms | 22 GB |
| 1x RTX 5090 | Q4 (AWQ) | 82 | 115 ms | 22 GB |
| 2x RTX 5090 | FP16 | 78 | 125 ms | 44 GB |
| 1x RTX 6000 Pro 96 GB | FP16 | 90 | 110 ms | 66 GB |
| 1x RTX 3090 | Q4 (AWQ) | 32 | 215 ms | 22 GB |
Qwen 2.5 72B performs comparably to LLaMA 3.1 70B on the same hardware, with slightly lower throughput due to architectural differences. The performance gap is small enough that model quality and feature capabilities should drive the selection. Check the tokens per second benchmark for additional configurations.
Quality Benchmark Scores
| Benchmark | Qwen 2.5 72B | LLaMA 3.1 70B | Mistral Large 2 |
|---|---|---|---|
| MMLU | 85.8 | 82.0 | 84.2 |
| HumanEval | 79.4 | 72.5 | 76.1 |
| GSM8K | 90.1 | 85.2 | 87.8 |
| MT-Bench | 8.7 | 8.4 | 8.6 |
| C-Eval (Chinese) | 89.2 | 62.5 | 65.8 |
Qwen 2.5 72B outperforms both LLaMA 3.1 70B and Mistral Large 2 on academic benchmarks while being significantly smaller than Mistral Large (72B vs 123B). Its Chinese language capability is outstanding, making it the default choice for CJK deployments.
Function Calling and Tool Use
Qwen 2.5 includes native function-calling support that ranks among the best in open-source models. In April 2026 testing, it achieves 92% function-call accuracy on standardised tool-use benchmarks, making it an excellent backbone for AI agent frameworks.
The model handles complex multi-step function calls, parameter extraction from natural language, and structured JSON output generation reliably. For teams building agentic applications on private infrastructure, Qwen 2.5’s tool-use capabilities reduce the need for extensive prompt engineering.
Deployment Configurations
| Use Case | Hardware | Precision | Engine |
|---|---|---|---|
| Budget production | 1x RTX 5090 | Q4 | vLLM |
| High-quality production | 2x RTX 5090 | FP16 | vLLM |
| Agent workloads | 1x RTX 5090 | Q4 | vLLM + LangGraph |
| Development | 1x RTX 3090 | Q4 | Ollama |
Deploy Qwen 2.5 on Dedicated Hardware
Apache 2.0 licensed, multilingual excellence, and strong function-calling on your own GPU server. No per-token fees.
Browse GPU ServersPerformance Verdict
Qwen 2.5 72B is the strongest all-round model at the 70B parameter class in April 2026. It outscores LLaMA 3.1 70B on every benchmark while maintaining comparable deployment requirements. Its Apache 2.0 license, multilingual capabilities, and function-calling support make it the recommended choice for new self-hosted deployments unless you specifically need LLaMA’s broader ecosystem compatibility or DeepSeek V3’s higher ceiling.
For cost modelling, use the cost per million tokens calculator. Compare with other models in the LLM benchmark rankings and the best open source LLMs guide.