Home / Blog / Model Guides / Qwen 2.5 32B VRAM Requirements: FP16, FP8 and AWQ INT4 Numbers

Model Guides

Qwen 2.5 32B VRAM Requirements: FP16, FP8 and AWQ INT4 Numbers

Exact VRAM footprint for Qwen 2.5 32B at FP16, FP8 and AWQ INT4, plus which GPUs fit and when a 32B is worth the jump from 14B.

Model Guides April 23, 2026 2 min read admin

Qwen 2.5 32B sits in an awkward but important band: too large for most consumer cards, but a lot cheaper to host than Llama 3.1 70B while matching or beating it on many benchmarks (particularly coding and maths). This guide gives exact VRAM numbers for FP16, FP8, and AWQ INT4; identifies which GPUs fit (it will not run on a 16 GB RTX 5060 Ti); and explains when 32B is worth the jump, all on our UK dedicated GPU hosting.

Weight size at each precision
KV cache maths
Which GPUs fit
When 32B beats 14B
Expected throughput
Choosing precision

Weight size at each precision

Qwen 2.5 32B has 32.5B parameters. The weight budget per precision is straightforward:

Precision	Bytes/param	Weight size	Quality drop vs FP16
FP16 / BF16	2.0	65.0 GB	baseline
FP8 (E4M3)	1.0	32.5 GB	< 0.3 on MMLU
AWQ INT4, g=128	0.53	17.2 GB	~1.0 on MMLU
GPTQ INT4, g=128	0.55	17.9 GB	~1.2 on MMLU
GGUF Q4_K_M	0.60	19.5 GB	~0.9 on MMLU
GGUF Q5_K_M	0.71	23.1 GB	~0.4 on MMLU

KV cache maths

Qwen 2.5 32B has 64 layers, 8 KV heads (GQA), 128 head-dim. KV per token = 2 * 2 * 64 * 8 * 128 = 262 KB. Per sequence:

Context	KV/sequence	KV * 4 users	Practical add on top of weights
4,096	1.0 GB	4.1 GB	+1 GB activation
8,192	2.1 GB	8.2 GB	+1.5 GB
32,768	8.4 GB	33.6 GB	+2.5 GB
131,072	33.6 GB	134.3 GB	+3.5 GB

Which GPUs fit Qwen 2.5 32B

GPU	VRAM	FP16 fit?	FP8 fit?	AWQ INT4 fit?	Verdict
RTX 5060 Ti 16GB	16 GB	No	No	Just under (no KV)	Not usable
RTX 3090 24GB	24 GB	No	No	Yes, 8k ctx	AWQ only, no FP8
RTX 4090 24GB	24 GB	No	No	Yes, 16k ctx	Works for AWQ
RTX 5090 32GB	32 GB	No	Tight (2k ctx, bs=1)	Yes, 64k ctx, bs=4	FP8 viable with small context
RTX 6000 Pro 96GB	96 GB	Yes, 16k ctx	Yes, 128k ctx, bs=8	Yes, 128k ctx, bs=32	Comfortable
A100 80GB	80 GB	Yes, 4k ctx	Yes, 64k ctx, bs=4	Yes, 128k ctx, bs=16	Production-grade
H100 80GB	80 GB	Yes, 8k ctx	Yes, 128k ctx, bs=8	Yes, 128k ctx, bs=32	Throughput leader

When 32B beats 14B

Qwen 2.5 14B is excellent value on a 16 GB card (see our Qwen 14B benchmark). 32B pulls ahead meaningfully in three areas:

Benchmark	Qwen 2.5 14B	Qwen 2.5 32B	Llama 3.1 70B
MMLU	79.7	83.3	83.6
HumanEval	83.5	88.4	80.5
MATH	55.6	65.9	68.0
IFEval	74.7	79.5	87.5
GPQA	38.4	49.5	48.0

If your workload is coding or maths-heavy, Qwen 2.5 32B is often the sweet spot: it beats Llama 70B on HumanEval while needing half the VRAM.

Expected throughput

Measured with vLLM 0.6, 2k output tokens, batch size 1:

GPU	Precision	Tokens/s (bs=1)	Tokens/s (bs=8)
RTX 4090 24GB	AWQ INT4	38	~105
RTX 5090 32GB	FP8	55	~230
RTX 5090 32GB	AWQ INT4	70	~260
RTX 6000 Pro 96GB	FP8	50	~320
A100 80GB	AWQ INT4	48	~260
H100 80GB	FP8	85	~520

Choosing precision

AWQ INT4: lowest VRAM, ~70 t/s on 5090; use when budget trumps last-mile quality.
FP8: Blackwell/Hopper native; use when you can afford the memory and want full quality.
FP16: only worthwhile on H100/A100 80GB or 6000 Pro if you are fine-tuning or serving in research mode.

For 70B comparisons see Llama 3 70B INT4 VRAM; for smaller sizes see 8B LLM VRAM requirements.

Host Qwen 2.5 32B on the right card

RTX 5090 for AWQ, RTX 6000 Pro 96GB for FP8 at 128k context. UK dedicated hosting.

Browse dedicated GPU hosting

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Qwen 2.5 32B VRAM Requirements: FP16, FP8 and AWQ INT4 Numbers

Contents

Weight size at each precision

KV cache maths

Which GPUs fit Qwen 2.5 32B

When 32B beats 14B

Expected throughput

Choosing precision

Host Qwen 2.5 32B on the right card

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Qwen 2.5 32B VRAM Requirements: FP16, FP8 and AWQ INT4 Numbers

Contents

Weight size at each precision

KV cache maths

Which GPUs fit Qwen 2.5 32B

When 32B beats 14B

Expected throughput

Choosing precision

Host Qwen 2.5 32B on the right card

Need a Dedicated GPU Server?

admin

Related Articles

Gemma 2 2B vs 9B vs 27B: Choosing the Right Size

Sentence-BERT vs BGE vs E5: Embedding Model Comparison

LLaMA 3.1 vs LLaMA 3: What Changed for GPU Hosting

The GDDR7 Advantage on the RTX 5060 Ti 16GB

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?