Qwen 2.5 32B-Instruct is the largest open-weight dense model that runs on a single RTX 4090 24GB without offloading – and it ranks alongside Llama 3.1 70B INT4 on most benchmarks despite using a single card instead of two. The trick is AWQ INT4 quantisation: weights compress to roughly 18 GB, leaving 6 GB for KV cache, paged-attention blocks and CUDA workspace. The RTX 4090 24GB is the cheapest path to a 32B-class assistant, and on our UK GPU hosting it delivers around 65 t/s decode with stable production behaviour at modest concurrency. This guide details architecture, the precise VRAM math that makes the fit work, throughput, the seven gotchas that will trip you up, and where Qwen 2.5 32B beats both larger models and the obvious alternatives.
Contents
- Why this is the tightest fit on a 4090
- Architecture and KV math
- VRAM budget breakdown
- Throughput, latency and concurrency limits
- Reasoning quality vs alternatives
- vLLM deployment and code
- Production gotchas (seven items)
- When to pick this over alternatives
- Verdict
Why this is the tightest fit on a 4090
FP16 weights are 64 GB – completely out of reach. FP8 weights are 32 GB – 8 GB over the 4090’s envelope, no good even with aggressive offload because every offloaded layer kills decode throughput. AWQ INT4 brings the model to 18 GB, leaving 6 GB. That 6 GB has to absorb KV cache, vLLM scheduler state, CUDA graphs, paged-attention blocks and any activation peaks. Get the configuration right and it works; push too hard on context length or concurrency and you OOM in the first hour of production traffic.
This is the only 32B-class dense model where consumer-card deployment is genuinely viable in 2026. The next step up is renting an RTX 5090 32GB (which fits FP8 with 0 GB headroom – so still requires AWQ for production) or paying considerably more for an H100 80GB.
Architecture and KV math
Qwen 2.5 32B-Instruct is a 64-layer transformer with 40 query heads, 8 key/value heads (5:1 GQA), 128-dim heads, SwiGLU MLP at 27,648 intermediate dim, vocabulary 152,064. Native context is 32,768; YaRN extends to 131,072 but quality at the upper end is weak relative to the 7B/14B – we strongly recommend 16k or 32k for production. License is the Qwen Community License (commercial use up to 100M MAU at no fee).
KV cache per token at FP16: 8 KV heads x 128 dim x 64 layers x 2 (K+V) x 2 bytes = 262 KB. At FP8: 131 KB. At 16k context that is 2.1 GB, which is the bulk of your headroom budget. This is why FP8 KV is mandatory and why 16k is the realistic ceiling for production concurrency.
VRAM budget breakdown
| Component | Size |
|---|---|
| AWQ INT4 weights | 18.0 GB |
| KV cache @ 16k FP8, 1 seq | 2.1 GB |
| vLLM scheduler + paged blocks | 1.2 GB |
| CUDA workspace (eager mode) | 0.6 GB |
| Activations peak | 0.8 GB |
| Total at batch 1, 16k | 22.7 GB |
| Free under 24 GB cap | 1.3 GB |
| Concurrency budget | Avg ctx | KV total | Verdict |
|---|---|---|---|
| 1 seq | 16k | 2.1 GB | Comfortable |
| 4 seqs | 4k | 2.1 GB | Comfortable |
| 8 seqs | 4k | 4.2 GB | Tight – cap max-num-seqs at 8 |
| 16 seqs | 2k | 4.2 GB | Tight – works in eager mode only |
| 1 seq | 32k | 4.2 GB | Tight – works without batching |
See the Qwen 2.5 32B VRAM breakdown for full per-precision tables.
Throughput, latency and concurrency limits
vLLM 0.6.3, AWQ Marlin, FP8 KV, eager mode, batch 1, prompt 512, generate 256:
| Metric | Batch 1 | Batch 4 | Batch 8 |
|---|---|---|---|
| Decode tokens/sec (per stream) | 65 | 52 | 40 |
| Aggregate tokens/sec | 65 | 208 | 320 |
| Time to first token (ms) | 240 | 280 | 340 |
| Memory used (GB) | 20.2 | 21.5 | 22.8 |
Prefill at 8k prompt batch 1: ~2,100 t/s. The aggregate plateau at batch 8 is the hard limit on this card – past that, KV pressure forces preemption and effective throughput falls. For higher concurrency you need a 5090 or two cards. Compare against the dedicated Qwen 32B benchmark.
Reasoning quality vs alternatives
| Model | MMLU | GSM8K | MATH | HumanEval | VRAM @ AWQ |
|---|---|---|---|---|---|
| Qwen 2.5 32B (AWQ) | 83.3 | 95.9 | 57.7 | 88.4 | 18 GB |
| Qwen 2.5 14B (FP8) | 79.7 | 90.2 | 55.6 | 86.6 | 9 GB |
| Llama 3.1 70B INT4 | 83.6 | 95.1 | 68.0 | 80.5 | 40 GB (2x 4090) |
| Mixtral 8x7B | 70.6 | 74.4 | 28.4 | 40.2 | 14 GB AWQ |
| Yi-34B | 73.5 | 72.0 | 21.0 | 60.4 | 19 GB |
Qwen 2.5 32B AWQ ties Llama 3.1 70B INT4 on MMLU and GSM8K, beats it on HumanEval, while running on a single 4090 instead of two. The MATH gap (57.7 vs 68.0) is real – if maths-heavy reasoning is your headline use case, that may push you to Llama 3.1 70B INT4 across two cards.
vLLM deployment and code
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-32B-Instruct-AWQ \
--quantization awq_marlin \
--kv-cache-dtype fp8 \
--max-model-len 16384 \
--max-num-seqs 8 \
--gpu-memory-utilization 0.95 \
--enforce-eager \
--enable-prefix-caching
Two flags are critical. --enforce-eager avoids the CUDA graph memory premium (1-2 GB) – mandatory when you are already at 22 GB. --gpu-memory-utilization 0.95 pushes vLLM to use the full envelope; the default 0.90 leaves 2.4 GB unused which you cannot afford.
For longer contexts at the cost of concurrency:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-32B-Instruct-AWQ \
--quantization awq_marlin --kv-cache-dtype fp8 \
--max-model-len 32768 --max-num-seqs 4 \
--gpu-memory-utilization 0.95 --enforce-eager
Chat template (ChatML, identical to other Qwen 2.5 variants):
<|im_start|>system
You are a helpful, harmless and honest assistant.<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
Our AWQ quantisation guide documents the calibration trade-offs that affect quality at this size.
Production gotchas (seven items)
- Eager mode is mandatory at 16k+: CUDA graphs add 1.5 GB overhead. With weights at 18 GB you cannot afford them. Throughput penalty is ~6%.
- Cap max-num-seqs at 8: above 8 concurrent sequences, KV pressure forces preemption and effective per-stream throughput collapses. Aggregate stays flat.
- Prefix caching is risky at this size: prefix blocks compete with user KV. If your system prompt is large, weigh the cache hit against the eviction risk. Disable if you see
NoFreeBlockserrors. - YaRN to 131k is not production-grade: long-context retrieval drops from ~95% at 32k to ~70% at 128k. Use 16k native; chunk-summarise larger documents.
- Cold start is 45+ seconds: AWQ weight load is 25 seconds, eager-mode warmup another 20. Pre-warm before traffic.
- Power and thermal headroom: at sustained batch 8 the 4090 will draw 440-450 W. In a chassis without good airflow it will throttle. Cap to 410 W with
nvidia-smi -pl 410for stability – throughput loss is under 4%. - Tokenizer stop tokens: pass
stop_token_ids=[151645, 151643]to your client to cleanly catch im_end and endoftext.
When to pick this over alternatives
Pick Qwen 2.5 32B AWQ over Qwen 2.5 14B when MMLU 80 is not enough quality, or when GSM8K 95.9 (vs 90.2) is the headline metric for your workload – typically agentic reasoning, long-form planning, or research-grade Q&A. Pick it over Llama 3.1 70B INT4 when you only have one GPU slot or one budget unit; pick the 70B if you need MATH 68 or strong multilingual.
Pick it over Yi-34B on every English-centric workload (MMLU 83.3 vs 73.5, HumanEval 88.4 vs 60.4) – Yi only competes on Chinese. Pick it over Mixtral 8x7B on every benchmark – Mixtral’s only remaining edge is decode speed (85 vs 65 t/s) thanks to its MoE architecture. Step laterally to Qwen 2.5 Coder 32B for coding workloads (HumanEval 92.7 vs 88.4, MBPP 87.0).
Verdict
Qwen 2.5 32B AWQ on the RTX 4090 24GB is the most capable single-card open-weight deployment in the 2026 landscape. 65 t/s decode with up to 8 concurrent sequences, frontier-class reasoning quality, and a hardware bill that is a fraction of any datacentre alternative. The constraints are real – 16k context ceiling for concurrency, eager mode mandatory, 8-stream concurrency cap – but for a 12-engineer team running a self-hosted reasoning assistant, a research lab needing GPT-4-class output without API costs, or a 50-tenant document-Q&A SaaS, this is the most cost-effective deployment you can build in 2026. Compare with the 5090 upgrade decision if you need higher concurrency or longer contexts.
Run a 32B reasoning model on a single GPU
Qwen 2.5 32B AWQ on the RTX 4090 24GB. UK dedicated hosting from Manchester.
Order the RTX 4090 24GBSee also: Qwen 2.5 Coder 32B, Llama 70B INT4 deployment, 4090 vs 5090 32GB, Yi-34B on the 4090, monthly cost, vs H100, vs OpenAI API.