RTX 3050 - Order Now
Home / Blog / Model Guides / Qwen 2.5 32B VRAM Requirements: FP16, FP8 and AWQ INT4 Numbers
Model Guides

Qwen 2.5 32B VRAM Requirements: FP16, FP8 and AWQ INT4 Numbers

Exact VRAM footprint for Qwen 2.5 32B at FP16, FP8 and AWQ INT4, plus which GPUs fit and when a 32B is worth the jump from 14B.

Qwen 2.5 32B sits in an awkward but important band: too large for most consumer cards, but a lot cheaper to host than Llama 3.1 70B while matching or beating it on many benchmarks (particularly coding and maths). This guide gives exact VRAM numbers for FP16, FP8, and AWQ INT4; identifies which GPUs fit (it will not run on a 16 GB RTX 5060 Ti); and explains when 32B is worth the jump, all on our UK dedicated GPU hosting.

Contents

Weight size at each precision

Qwen 2.5 32B has 32.5B parameters. The weight budget per precision is straightforward:

PrecisionBytes/paramWeight sizeQuality drop vs FP16
FP16 / BF162.065.0 GBbaseline
FP8 (E4M3)1.032.5 GB< 0.3 on MMLU
AWQ INT4, g=1280.5317.2 GB~1.0 on MMLU
GPTQ INT4, g=1280.5517.9 GB~1.2 on MMLU
GGUF Q4_K_M0.6019.5 GB~0.9 on MMLU
GGUF Q5_K_M0.7123.1 GB~0.4 on MMLU

KV cache maths

Qwen 2.5 32B has 64 layers, 8 KV heads (GQA), 128 head-dim. KV per token = 2 * 2 * 64 * 8 * 128 = 262 KB. Per sequence:

ContextKV/sequenceKV * 4 usersPractical add on top of weights
4,0961.0 GB4.1 GB+1 GB activation
8,1922.1 GB8.2 GB+1.5 GB
32,7688.4 GB33.6 GB+2.5 GB
131,07233.6 GB134.3 GB+3.5 GB

Which GPUs fit Qwen 2.5 32B

GPUVRAMFP16 fit?FP8 fit?AWQ INT4 fit?Verdict
RTX 5060 Ti 16GB16 GBNoNoJust under (no KV)Not usable
RTX 3090 24GB24 GBNoNoYes, 8k ctxAWQ only, no FP8
RTX 4090 24GB24 GBNoNoYes, 16k ctxWorks for AWQ
RTX 5090 32GB32 GBNoTight (2k ctx, bs=1)Yes, 64k ctx, bs=4FP8 viable with small context
RTX 6000 Pro 96GB96 GBYes, 16k ctxYes, 128k ctx, bs=8Yes, 128k ctx, bs=32Comfortable
A100 80GB80 GBYes, 4k ctxYes, 64k ctx, bs=4Yes, 128k ctx, bs=16Production-grade
H100 80GB80 GBYes, 8k ctxYes, 128k ctx, bs=8Yes, 128k ctx, bs=32Throughput leader

When 32B beats 14B

Qwen 2.5 14B is excellent value on a 16 GB card (see our Qwen 14B benchmark). 32B pulls ahead meaningfully in three areas:

BenchmarkQwen 2.5 14BQwen 2.5 32BLlama 3.1 70B
MMLU79.783.383.6
HumanEval83.588.480.5
MATH55.665.968.0
IFEval74.779.587.5
GPQA38.449.548.0

If your workload is coding or maths-heavy, Qwen 2.5 32B is often the sweet spot: it beats Llama 70B on HumanEval while needing half the VRAM.

Expected throughput

Measured with vLLM 0.6, 2k output tokens, batch size 1:

GPUPrecisionTokens/s (bs=1)Tokens/s (bs=8)
RTX 4090 24GBAWQ INT438~105
RTX 5090 32GBFP855~230
RTX 5090 32GBAWQ INT470~260
RTX 6000 Pro 96GBFP850~320
A100 80GBAWQ INT448~260
H100 80GBFP885~520

Choosing precision

  • AWQ INT4: lowest VRAM, ~70 t/s on 5090; use when budget trumps last-mile quality.
  • FP8: Blackwell/Hopper native; use when you can afford the memory and want full quality.
  • FP16: only worthwhile on H100/A100 80GB or 6000 Pro if you are fine-tuning or serving in research mode.

For 70B comparisons see Llama 3 70B INT4 VRAM; for smaller sizes see 8B LLM VRAM requirements.

Host Qwen 2.5 32B on the right card

RTX 5090 for AWQ, RTX 6000 Pro 96GB for FP8 at 128k context. UK dedicated hosting.

Browse dedicated GPU hosting

See also: Qwen 14B on 5060 Ti, upgrade to RTX 5090, upgrade to RTX 6000 Pro, 70B VRAM requirements, Qwen Coder 14B.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?