Home / Blog / Model Guides / RTX 4090 24GB for Qwen 2.5 32B AWQ: Tight Fit, Frontier-Class Reasoning

Model Guides

RTX 4090 24GB for Qwen 2.5 32B AWQ: Tight Fit, Frontier-Class Reasoning

Qwen 2.5 32B AWQ INT4 squeezes onto an RTX 4090 24GB at ~18GB - the most capable reasoning model that fits a single consumer card, with measured 65 t/s decode.

Model Guides May 4, 2026 5 min read gigagpu

Qwen 2.5 32B-Instruct is the largest open-weight dense model that runs on a single RTX 4090 24GB without offloading – and it ranks alongside Llama 3.1 70B INT4 on most benchmarks despite using a single card instead of two. The trick is AWQ INT4 quantisation: weights compress to roughly 18 GB, leaving 6 GB for KV cache, paged-attention blocks and CUDA workspace. The RTX 4090 24GB is the cheapest path to a 32B-class assistant, and on our UK GPU hosting it delivers around 65 t/s decode with stable production behaviour at modest concurrency. This guide details architecture, the precise VRAM math that makes the fit work, throughput, the seven gotchas that will trip you up, and where Qwen 2.5 32B beats both larger models and the obvious alternatives.

Why this is the tightest fit on a 4090

FP16 weights are 64 GB – completely out of reach. FP8 weights are 32 GB – 8 GB over the 4090’s envelope, no good even with aggressive offload because every offloaded layer kills decode throughput. AWQ INT4 brings the model to 18 GB, leaving 6 GB. That 6 GB has to absorb KV cache, vLLM scheduler state, CUDA graphs, paged-attention blocks and any activation peaks. Get the configuration right and it works; push too hard on context length or concurrency and you OOM in the first hour of production traffic.

This is the only 32B-class dense model where consumer-card deployment is genuinely viable in 2026. The next step up is renting an RTX 5090 32GB (which fits FP8 with 0 GB headroom – so still requires AWQ for production) or paying considerably more for an H100 80GB.

Architecture and KV math

Qwen 2.5 32B-Instruct is a 64-layer transformer with 40 query heads, 8 key/value heads (5:1 GQA), 128-dim heads, SwiGLU MLP at 27,648 intermediate dim, vocabulary 152,064. Native context is 32,768; YaRN extends to 131,072 but quality at the upper end is weak relative to the 7B/14B – we strongly recommend 16k or 32k for production. License is the Qwen Community License (commercial use up to 100M MAU at no fee).

KV cache per token at FP16: 8 KV heads x 128 dim x 64 layers x 2 (K+V) x 2 bytes = 262 KB. At FP8: 131 KB. At 16k context that is 2.1 GB, which is the bulk of your headroom budget. This is why FP8 KV is mandatory and why 16k is the realistic ceiling for production concurrency.

VRAM budget breakdown

Component	Size
AWQ INT4 weights	18.0 GB
KV cache @ 16k FP8, 1 seq	2.1 GB
vLLM scheduler + paged blocks	1.2 GB
CUDA workspace (eager mode)	0.6 GB
Activations peak	0.8 GB
Total at batch 1, 16k	22.7 GB
Free under 24 GB cap	1.3 GB

Concurrency budget	Avg ctx	KV total	Verdict
1 seq	16k	2.1 GB	Comfortable
4 seqs	4k	2.1 GB	Comfortable
8 seqs	4k	4.2 GB	Tight – cap max-num-seqs at 8
16 seqs	2k	4.2 GB	Tight – works in eager mode only
1 seq	32k	4.2 GB	Tight – works without batching

See the Qwen 2.5 32B VRAM breakdown for full per-precision tables.

Throughput, latency and concurrency limits

vLLM 0.6.3, AWQ Marlin, FP8 KV, eager mode, batch 1, prompt 512, generate 256:

Metric	Batch 1	Batch 4	Batch 8
Decode tokens/sec (per stream)	65	52	40
Aggregate tokens/sec	65	208	320
Time to first token (ms)	240	280	340
Memory used (GB)	20.2	21.5	22.8

Prefill at 8k prompt batch 1: ~2,100 t/s. The aggregate plateau at batch 8 is the hard limit on this card – past that, KV pressure forces preemption and effective throughput falls. For higher concurrency you need a 5090 or two cards. Compare against the dedicated Qwen 32B benchmark.

Reasoning quality vs alternatives

Model	MMLU	GSM8K	MATH	HumanEval	VRAM @ AWQ
Qwen 2.5 32B (AWQ)	83.3	95.9	57.7	88.4	18 GB
Qwen 2.5 14B (FP8)	79.7	90.2	55.6	86.6	9 GB
Llama 3.1 70B INT4	83.6	95.1	68.0	80.5	40 GB (2x 4090)
Mixtral 8x7B	70.6	74.4	28.4	40.2	14 GB AWQ
Yi-34B	73.5	72.0	21.0	60.4	19 GB

Qwen 2.5 32B AWQ ties Llama 3.1 70B INT4 on MMLU and GSM8K, beats it on HumanEval, while running on a single 4090 instead of two. The MATH gap (57.7 vs 68.0) is real – if maths-heavy reasoning is your headline use case, that may push you to Llama 3.1 70B INT4 across two cards.

vLLM deployment and code

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-32B-Instruct-AWQ \
  --quantization awq_marlin \
  --kv-cache-dtype fp8 \
  --max-model-len 16384 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.95 \
  --enforce-eager \
  --enable-prefix-caching

Two flags are critical. --enforce-eager avoids the CUDA graph memory premium (1-2 GB) – mandatory when you are already at 22 GB. --gpu-memory-utilization 0.95 pushes vLLM to use the full envelope; the default 0.90 leaves 2.4 GB unused which you cannot afford.

For longer contexts at the cost of concurrency:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-32B-Instruct-AWQ \
  --quantization awq_marlin --kv-cache-dtype fp8 \
  --max-model-len 32768 --max-num-seqs 4 \
  --gpu-memory-utilization 0.95 --enforce-eager

Chat template (ChatML, identical to other Qwen 2.5 variants):

<|im_start|>system
You are a helpful, harmless and honest assistant.<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant

Our AWQ quantisation guide documents the calibration trade-offs that affect quality at this size.

Production gotchas (seven items)

Eager mode is mandatory at 16k+: CUDA graphs add 1.5 GB overhead. With weights at 18 GB you cannot afford them. Throughput penalty is ~6%.
Cap max-num-seqs at 8: above 8 concurrent sequences, KV pressure forces preemption and effective per-stream throughput collapses. Aggregate stays flat.
Prefix caching is risky at this size: prefix blocks compete with user KV. If your system prompt is large, weigh the cache hit against the eviction risk. Disable if you see NoFreeBlocks errors.
YaRN to 131k is not production-grade: long-context retrieval drops from ~95% at 32k to ~70% at 128k. Use 16k native; chunk-summarise larger documents.
Cold start is 45+ seconds: AWQ weight load is 25 seconds, eager-mode warmup another 20. Pre-warm before traffic.
Power and thermal headroom: at sustained batch 8 the 4090 will draw 440-450 W. In a chassis without good airflow it will throttle. Cap to 410 W with nvidia-smi -pl 410 for stability – throughput loss is under 4%.
Tokenizer stop tokens: pass stop_token_ids=[151645, 151643] to your client to cleanly catch im_end and endoftext.

When to pick this over alternatives

Pick Qwen 2.5 32B AWQ over Qwen 2.5 14B when MMLU 80 is not enough quality, or when GSM8K 95.9 (vs 90.2) is the headline metric for your workload – typically agentic reasoning, long-form planning, or research-grade Q&A. Pick it over Llama 3.1 70B INT4 when you only have one GPU slot or one budget unit; pick the 70B if you need MATH 68 or strong multilingual.

Pick it over Yi-34B on every English-centric workload (MMLU 83.3 vs 73.5, HumanEval 88.4 vs 60.4) – Yi only competes on Chinese. Pick it over Mixtral 8x7B on every benchmark – Mixtral’s only remaining edge is decode speed (85 vs 65 t/s) thanks to its MoE architecture. Step laterally to Qwen 2.5 Coder 32B for coding workloads (HumanEval 92.7 vs 88.4, MBPP 87.0).

Verdict

Qwen 2.5 32B AWQ on the RTX 4090 24GB is the most capable single-card open-weight deployment in the 2026 landscape. 65 t/s decode with up to 8 concurrent sequences, frontier-class reasoning quality, and a hardware bill that is a fraction of any datacentre alternative. The constraints are real – 16k context ceiling for concurrency, eager mode mandatory, 8-stream concurrency cap – but for a 12-engineer team running a self-hosted reasoning assistant, a research lab needing GPT-4-class output without API costs, or a 50-tenant document-Q&A SaaS, this is the most cost-effective deployment you can build in 2026. Compare with the 5090 upgrade decision if you need higher concurrency or longer contexts.

Run a 32B reasoning model on a single GPU

Qwen 2.5 32B AWQ on the RTX 4090 24GB. UK dedicated hosting from Manchester.

Order the RTX 4090 24GB

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RTX 4090 24GB for Qwen 2.5 32B AWQ: Tight Fit, Frontier-Class Reasoning

Contents

Why this is the tightest fit on a 4090

Architecture and KV math

VRAM budget breakdown

Throughput, latency and concurrency limits

Reasoning quality vs alternatives

vLLM deployment and code

Production gotchas (seven items)

When to pick this over alternatives

Verdict

Run a 32B reasoning model on a single GPU

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB for Qwen 2.5 32B AWQ: Tight Fit, Frontier-Class Reasoning

Contents

Why this is the tightest fit on a 4090

Architecture and KV math

VRAM budget breakdown

Throughput, latency and concurrency limits

Reasoning quality vs alternatives

vLLM deployment and code

Production gotchas (seven items)

When to pick this over alternatives

Verdict

Run a 32B reasoning model on a single GPU

Need a Dedicated GPU Server?

gigagpu

Related Articles

Run Bark TTS on a Dedicated GPU Server

Flux.1 Dev vs Schnell: Choosing the Right Variant

RTX 4090 24GB for Llama 3.2 Vision 11B: OCR, Multi-Image Reasoning and Latency

How to Set Up ComfyUI on a Dedicated GPU Server

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?