Table of Contents
GPU Selection for Qwen 2.5
Qwen 2.5 from Alibaba Cloud is a strong multilingual LLM available in sizes from 0.5B to 72B. Choosing the right GPU on a dedicated GPU server depends on which Qwen 2.5 variant you plan to run:
| Qwen 2.5 Variant | FP16 VRAM | INT4 VRAM | Recommended GPU |
|---|---|---|---|
| Qwen 2.5 0.5B | ~1.5 GB | ~0.8 GB | RTX 3050 |
| Qwen 2.5 1.5B | ~3.5 GB | ~1.8 GB | RTX 4060 |
| Qwen 2.5 7B | ~15 GB | ~5.5 GB | RTX 4060 Ti |
| Qwen 2.5 14B | ~28 GB | ~9 GB | RTX 3090 |
| Qwen 2.5 32B | ~64 GB | ~20 GB | RTX 3090 (INT4) or multi-GPU |
| Qwen 2.5 72B | ~144 GB | ~42 GB | Multi-GPU required |
For most deployments, the 7B variant on an RTX 4060 Ti or the 14B at INT4 on an RTX 3090 offers the best balance of quality and cost. For a full comparison with LLaMA, see our Qwen vs LLaMA 3 multilingual comparison.
Install and Serve with vLLM
# Install vLLM
pip install vllm
# Serve Qwen 2.5 7B Instruct
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--port 8000
# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "Explain transformer attention mechanisms."}],
"max_tokens": 512
}'
vLLM natively supports Qwen 2.5 with continuous batching and PagedAttention. For a comparison of serving frameworks, see our vLLM vs Ollama guide.
Quick Start with Ollama
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Qwen 2.5 7B
ollama run qwen2.5:7b-instruct
# Serve as API
ollama serve &
curl http://localhost:11434/api/generate \
-d '{"model": "qwen2.5:7b-instruct", "prompt": "Translate this to Chinese: Hello world"}'
Performance Benchmarks
Benchmarked with vLLM, 512-token input, 256-token output on various GPUs. See the tokens-per-second benchmark tool for current data.
| Model | GPU | Precision | Gen tok/s | TTFT |
|---|---|---|---|---|
| Qwen 2.5 7B | RTX 4060 Ti | FP16 | 88 | 185 ms |
| Qwen 2.5 7B | RTX 3090 | FP16 | 96 | 168 ms |
| Qwen 2.5 7B | RTX 4060 | AWQ 4-bit | 112 | 145 ms |
| Qwen 2.5 14B | RTX 3090 | AWQ 4-bit | 72 | 225 ms |
| Qwen 2.5 32B | RTX 3090 | AWQ 4-bit | 38 | 410 ms |
Qwen 2.5 7B delivers competitive throughput to LLaMA 3 8B on the same hardware, with the added benefit of strong multilingual capabilities including Chinese, Japanese, and Korean.
Optimisation Tips
- Use the 7B variant for most tasks. It scores within 3-5% of the 14B on most English benchmarks and runs on budget GPUs.
- AWQ 4-bit quantisation is recommended for the 14B and 32B variants to fit on single consumer GPUs.
- Enable continuous batching in vLLM for production multi-user serving.
- Use Qwen 2.5 Coder for code-specific tasks, which outperforms the base model on HumanEval and MBPP.
- Set context to 8K tokens for balanced VRAM usage. Qwen supports 128K but requires proportionally more memory.
Estimate your costs with the cost-per-million-tokens calculator. Read the self-host LLM guide for production deployment details.
Next Steps
Qwen 2.5 is an excellent choice for multilingual and coding workloads. For English-focused deployments, compare with LLaMA 3 hosting. Browse GPU options with the GPU comparisons tool, or check all available models in the model guides section.
Deploy Qwen 2.5 Now
Run Qwen 2.5 on a dedicated GPU server with full root access. From 7B to 72B, choose the configuration that fits your workload.
Browse GPU Servers