Qwen 2.5 32B is the strongest open weight in the 25-35B class, beating Llama 3.1 70B on MATH and IFEval and matching it on most knowledge benchmarks at less than half the parameter count. AWQ INT4 quantisation lets it run very comfortably on a single RTX 4090 24GB dedicated server with room for serious batching, hosted from our UK datacentre. This post works the full cost economics: monthly capacity at every realistic utilisation tier, volume tables from 10M to 10B tokens, MAU and concurrency sizing, $/M-token comparisons, break-even calculations against both API providers and managed Qwen endpoints, hidden costs you should plan for, and a 12-month TCO model.
Contents
- Why Qwen 2.5 32B
- VRAM and concurrency math
- Monthly cost basis and hidden costs
- Throughput and capacity tiers
- Volume tables (10M to 10B tokens)
- MAU and concurrency tiers
- $/M tokens and break-even
- 12-month TCO and verdict
Why Qwen 2.5 32B
| Benchmark | Qwen 2.5 32B | Llama 3.1 70B | GPT-4o-mini | Claude 3 Haiku |
|---|---|---|---|---|
| MMLU | 83.3 | 86.0 | 82.0 | 75.2 |
| HumanEval | 88.4 | 80.5 | 87.2 | 75.9 |
| MATH | 83.1 | 68.0 | 70.2 | 40.9 |
| IFEval | 79.5 | 87.5 | 80.5 | 76.0 |
| MTBench | 8.62 | 8.61 | 8.36 | 8.10 |
Qwen wins on code and maths, sits very close to 70B on knowledge, and runs roughly 3.5x faster on the same hardware because of its 32B vs 70B parameter advantage and tighter GQA. Architecturally: 64 layers, 8 KV heads, head_dim 128 — KV is friendly. See the Qwen 32B benchmark for raw throughput data.
VRAM and concurrency math
KV cost per token is 2 * 64 * 8 * 128 * 1 = 131,072 bytes = 128 KB/token at FP8. That is denser than Nemo but lighter than Phi-3 Medium per layer.
| Quant | Weights | KV @ 8k FP8 | Total per stream | Realistic batch on 24 GB |
|---|---|---|---|---|
| BF16 | 61 GB | 2.0 GB | 63 GB | 0 (no fit) |
| FP8 W8A8 | 32 GB | 1.0 GB | 33 GB | 0 (no fit) |
| AWQ INT4 | 18.5 GB | 1.0 GB | 19.5 GB | 4-8 streams @ 8k ctx |
| GPTQ INT4 | 18.0 GB | 1.0 GB | 19.0 GB | 4-8 streams @ 8k ctx |
| GGUF Q4_K_M (llama.cpp) | ~19 GB | 1.0 GB | 20 GB | 1-2 streams (no PagedAttention) |
AWQ INT4 with FP8 KV is the production sweet spot. FP8 weights are too tight to fit on a single 4090 with realistic context — that is a 32 GB+ workload (5090 territory).
Monthly cost basis and hidden costs
| Component | Cost / month | Notes |
|---|---|---|
| 4090 dedicated UK | £500-650 (~$700) | Includes server, power, cooling, IPMI |
| Bandwidth | included | 1 Gbps unmetered typical |
| Storage 2 TB NVMe | included | Enough for several model variants |
| Backup / object storage | £10-30 | For model artifacts, logs |
| Monitoring (Grafana Cloud or self-host) | £0-30 | Free tier sufficient at this scale |
| Engineer time, ongoing ops | ~2 hrs/week | Updates, monitoring, incidents |
| Initial setup engineer time | ~10-15 hrs one-off | vLLM, auth, Grafana, runbook |
Modelling at $700/month all-in for the rest of this post. Compare with cloud GPU rentals: RunPod community 4090 at $0.34/hr is ~$248/month but spot with no SLA; RunPod secure at $0.69/hr is ~$497/month; Lambda 4090 at $0.50/hr is ~$365/month. Dedicated UK hosting is more expensive per hour than spot, but provides static IP, predictable network, and no scheduler eviction risk.
Throughput and capacity tiers
| Concurrent streams | Per-stream t/s | Aggregate t/s | Tokens/day @ 100% |
|---|---|---|---|
| 1 | 65 | 65 | 5.6 M |
| 2 | 58 | 116 | 10.0 M |
| 4 | 45 | 180 | 15.5 M |
| 6 | 34 | 204 | 17.6 M |
| 8 | 27.5 | 220 | 19.0 M |
| 12 | 21 | 252 | 21.8 M |
| 16 (KV cap risk) | 17 | 272 | 23.5 M |
Sweet spot at batch 8-12 with 220-260 aggregate t/s. Above batch 8 KV pressure starts to dominate; above 16 you risk preemption under traffic spikes. Realistic sustained throughput target: 220 t/s = 19M tokens/day = 570M tokens/month.
Volume tables (10M to 10B tokens)
| Volume / month | Average util on 4090 | Cost on 4090 | Cost on Together Qwen 32B ($0.40/M) | Cost on GPT-4o ($5/M) | Cost on Claude Sonnet ($7/M) |
|---|---|---|---|---|---|
| 10 M | 1.7% | $700 | $4 | $50 | $70 |
| 50 M | 8.5% | $700 | $20 | $250 | $350 |
| 100 M | 17% | $700 | $40 | $500 | $700 |
| 500 M | 85% | $700 | $200 | $2,500 | $3,500 |
| 1 B | need 2x cards | $1,400 | $400 | $5,000 | $7,000 |
| 5 B | need ~9x cards | $6,300 | $2,000 | $25,000 | $35,000 |
| 10 B | need ~18x cards | $12,600 | $4,000 | $50,000 | $70,000 |
Two clear regimes. Below ~150M tokens/month, hosted APIs are cheaper because you’re not utilising the GPU. Between 150M and 570M, dedicated 4090 wins decisively. Above 1B, you’re either fanning out to multiple 4090s (fine) or considering a 6000 Pro or H100 instead. See 4090 vs H100 for that decision.
MAU and concurrency tiers
Token consumption per active user varies wildly by product. Below are realistic averages for three named scenarios.
| Product type | Tokens / active user / month | Users on 1x 4090 (570M cap) | Peak concurrent |
|---|---|---|---|
| Customer-support chat (5-turn avg) | ~12,000 | ~47,000 MAU | ~30 active |
| RAG knowledge assistant (long context) | ~30,000 | ~19,000 MAU | ~12 active |
| Coding assistant (heavy session) | ~150,000 | ~3,800 MAU | ~5 active |
| Background classification (no UX) | n/a (batch) | ~570M tokens classified | 8-12 batched |
| Email summarisation (1k in / 200 out) | ~36,000 | ~16,000 MAU | ~10 active |
Sizing heuristic: a single 4090 with Qwen 32B handles a SaaS with 15-50k MAU comfortably depending on session intensity. Past that, scale to 2-3 cards or move flagship traffic to H100. Cross-reference with the concurrent users guide.
$/M tokens and break-even
| Provider / model | Blended $/M | 4090 break-even tokens/mo | 4090 capacity | Headroom |
|---|---|---|---|---|
| 4090 + Qwen 32B @ 90% util | $1.07 | baseline | 654 M | n/a |
| 4090 + Qwen 32B @ 70% util | $1.38 | baseline | 508 M | n/a |
| OpenAI GPT-4o ($5) | $5.00 | 140 M | 654 M | 4.7x past break-even |
| OpenAI GPT-4o-mini ($0.30) | $0.30 | 2.33 B | 654 M | API wins (GPU saturates) |
| Claude Sonnet ($7) | $7.00 | 100 M | 654 M | 6.5x past break-even |
| Claude Haiku ($0.58) | $0.58 | 1.21 B | 654 M | API wins |
| Together Qwen 32B ($0.40) | $0.40 | 1.75 B | 654 M | API wins (use Together below 500M) |
The takeaway: at production utilisation a self-hosted Qwen 32B is roughly $1.07/M, which is half the price of GPT-4o-class quality APIs and well below Claude Sonnet. Below ~150M tokens/month, hosted APIs are cheaper. Above that, dedicated 4090 wins decisively. Against managed Qwen 32B endpoints (Together, Fireworks), the cross-over sits around 1.7B tokens/month, which is past a single 4090’s capacity — so use Together for moderate volumes, dedicated 4090 for predictable production loads in the 150M-570M band.
12-month TCO and verdict
| Volume tier | Best provider | 12-month cost | vs alternative |
|---|---|---|---|
| 10-100 M tokens/mo | Together / Anyscale Qwen | $50-500/yr | 4090 wastes capacity |
| 100-500 M tokens/mo | Dedicated 4090 | $8,400 | vs $30,000-42,000 on Sonnet |
| 500 M-1 B tokens/mo | Dedicated 4090, near max | $8,400 | vs $60,000-84,000 on Sonnet |
| 1-3 B tokens/mo | 2-3x 4090 or 1x H100 | $16,800-25,200 | still 50-70% under hosted APIs |
| 3-10 B tokens/mo | H100 fleet or RTX 6000 Pro | varies | see 4090 vs H100 |
Verdict. A single 4090 running Qwen 2.5 32B AWQ is the most cost-effective Sonnet-class self-hosted deployment on the market for the 100M-570M tokens/month band. Below that, use a managed Qwen endpoint. Above 1B tokens/month, fan out to multiple 4090s or upgrade to H100. The hidden costs (ops, monitoring, initial setup) are real but small relative to the API savings: you recover them in the first month at any volume past 200M tokens.
Run Qwen 2.5 32B in the UK
AWQ INT4 on a single 4090, ~220 aggregate t/s, $1.07/M at production util. UK dedicated hosting.
Order the RTX 4090 24GBSee also: 4090 for Qwen 32B, Qwen 32B benchmark, AWQ guide, vLLM setup, vs OpenAI, vs Anthropic, break-even calculator, ROI analysis, monthly hosting cost.