Qwen 2.5 32B sits in an awkward but important band: too large for most consumer cards, but a lot cheaper to host than Llama 3.1 70B while matching or beating it on many benchmarks (particularly coding and maths). This guide gives exact VRAM numbers for FP16, FP8, and AWQ INT4; identifies which GPUs fit (it will not run on a 16 GB RTX 5060 Ti); and explains when 32B is worth the jump, all on our UK dedicated GPU hosting.
Contents
- Weight size at each precision
- KV cache maths
- Which GPUs fit
- When 32B beats 14B
- Expected throughput
- Choosing precision
Weight size at each precision
Qwen 2.5 32B has 32.5B parameters. The weight budget per precision is straightforward:
| Precision | Bytes/param | Weight size | Quality drop vs FP16 |
|---|---|---|---|
| FP16 / BF16 | 2.0 | 65.0 GB | baseline |
| FP8 (E4M3) | 1.0 | 32.5 GB | < 0.3 on MMLU |
| AWQ INT4, g=128 | 0.53 | 17.2 GB | ~1.0 on MMLU |
| GPTQ INT4, g=128 | 0.55 | 17.9 GB | ~1.2 on MMLU |
| GGUF Q4_K_M | 0.60 | 19.5 GB | ~0.9 on MMLU |
| GGUF Q5_K_M | 0.71 | 23.1 GB | ~0.4 on MMLU |
KV cache maths
Qwen 2.5 32B has 64 layers, 8 KV heads (GQA), 128 head-dim. KV per token = 2 * 2 * 64 * 8 * 128 = 262 KB. Per sequence:
| Context | KV/sequence | KV * 4 users | Practical add on top of weights |
|---|---|---|---|
| 4,096 | 1.0 GB | 4.1 GB | +1 GB activation |
| 8,192 | 2.1 GB | 8.2 GB | +1.5 GB |
| 32,768 | 8.4 GB | 33.6 GB | +2.5 GB |
| 131,072 | 33.6 GB | 134.3 GB | +3.5 GB |
Which GPUs fit Qwen 2.5 32B
| GPU | VRAM | FP16 fit? | FP8 fit? | AWQ INT4 fit? | Verdict |
|---|---|---|---|---|---|
| RTX 5060 Ti 16GB | 16 GB | No | No | Just under (no KV) | Not usable |
| RTX 3090 24GB | 24 GB | No | No | Yes, 8k ctx | AWQ only, no FP8 |
| RTX 4090 24GB | 24 GB | No | No | Yes, 16k ctx | Works for AWQ |
| RTX 5090 32GB | 32 GB | No | Tight (2k ctx, bs=1) | Yes, 64k ctx, bs=4 | FP8 viable with small context |
| RTX 6000 Pro 96GB | 96 GB | Yes, 16k ctx | Yes, 128k ctx, bs=8 | Yes, 128k ctx, bs=32 | Comfortable |
| A100 80GB | 80 GB | Yes, 4k ctx | Yes, 64k ctx, bs=4 | Yes, 128k ctx, bs=16 | Production-grade |
| H100 80GB | 80 GB | Yes, 8k ctx | Yes, 128k ctx, bs=8 | Yes, 128k ctx, bs=32 | Throughput leader |
When 32B beats 14B
Qwen 2.5 14B is excellent value on a 16 GB card (see our Qwen 14B benchmark). 32B pulls ahead meaningfully in three areas:
| Benchmark | Qwen 2.5 14B | Qwen 2.5 32B | Llama 3.1 70B |
|---|---|---|---|
| MMLU | 79.7 | 83.3 | 83.6 |
| HumanEval | 83.5 | 88.4 | 80.5 |
| MATH | 55.6 | 65.9 | 68.0 |
| IFEval | 74.7 | 79.5 | 87.5 |
| GPQA | 38.4 | 49.5 | 48.0 |
If your workload is coding or maths-heavy, Qwen 2.5 32B is often the sweet spot: it beats Llama 70B on HumanEval while needing half the VRAM.
Expected throughput
Measured with vLLM 0.6, 2k output tokens, batch size 1:
| GPU | Precision | Tokens/s (bs=1) | Tokens/s (bs=8) |
|---|---|---|---|
| RTX 4090 24GB | AWQ INT4 | 38 | ~105 |
| RTX 5090 32GB | FP8 | 55 | ~230 |
| RTX 5090 32GB | AWQ INT4 | 70 | ~260 |
| RTX 6000 Pro 96GB | FP8 | 50 | ~320 |
| A100 80GB | AWQ INT4 | 48 | ~260 |
| H100 80GB | FP8 | 85 | ~520 |
Choosing precision
- AWQ INT4: lowest VRAM, ~70 t/s on 5090; use when budget trumps last-mile quality.
- FP8: Blackwell/Hopper native; use when you can afford the memory and want full quality.
- FP16: only worthwhile on H100/A100 80GB or 6000 Pro if you are fine-tuning or serving in research mode.
For 70B comparisons see Llama 3 70B INT4 VRAM; for smaller sizes see 8B LLM VRAM requirements.
Host Qwen 2.5 32B on the right card
RTX 5090 for AWQ, RTX 6000 Pro 96GB for FP8 at 128k context. UK dedicated hosting.
Browse dedicated GPU hostingSee also: Qwen 14B on 5060 Ti, upgrade to RTX 5090, upgrade to RTX 6000 Pro, 70B VRAM requirements, Qwen Coder 14B.