Serving a 70B model on a single GPU used to mean an A100 80 GB at $2-3 per hour. Thanks to AWQ INT4 quantisation and vLLM’s marlin kernels, the RTX 4090 24GB dedicated server can host Llama 3.1 70B comfortably and serve it at a price-per-million-tokens that demolishes hosted APIs above modest volumes. This article works the numbers from monthly fee through capacity to effective $/M token, with MAU break-even tables and a 12-month TCO comparison on GigaGPU dedicated hosting.
Contents
- Llama 3.1 70B AWQ on a 4090
- vLLM launch and tuning
- Monthly fixed cost
- Throughput in production
- Tokens per month and volume tables
- Effective $/M token and MAU break-even
- 12-month TCO vs hosted APIs
- Production gotchas
Llama 3.1 70B AWQ on a 4090
AWQ packs the 70B weights at 4 bits per weight, around 18.9 GB resident. Add KV cache, vLLM scheduler overhead and CUDA context: ~22-23 GB on a 24 GB card. With --gpu-memory-utilization 0.95 and a 16k-token context cap, vLLM serves 4 concurrent streams comfortably with KV-cache headroom for prefill bursts.
| Quantisation | Weights VRAM | Decode t/s | Notes |
|---|---|---|---|
| AWQ INT4 (marlin) | 18.9 GB | 22-24 | Production sweet spot |
| GPTQ INT4 | 18.9 GB | 20-22 | Marginally slower; widely supported |
| GGUF Q4_K_M | ~21 GB | 14-16 | llama.cpp; no vLLM batching |
| FP8 (dual GPU) | n/a single | n/a | Needs 2x card with NVLink-style fabric |
| BF16 | 140 GB | n/a single | Multi-card territory |
For the deployment recipe see the Llama 70B INT4 deployment guide; for the underlying benchmark numbers see 70B INT4 benchmark.
vLLM launch and tuning
The canonical launch line, with FP8 KV cache to claw back VRAM:
python -m vllm.entrypoints.openai.api_server \
--model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
--quantization awq_marlin --kv-cache-dtype fp8 \
--max-model-len 16384 --max-num-seqs 4 \
--gpu-memory-utilization 0.95
Key choices: awq_marlin picks the optimised Ada kernel; kv-cache-dtype fp8 halves KV memory at no measurable accuracy cost on Ada; max-num-seqs 4 matches the achievable concurrency given KV constraints; gpu-memory-utilization 0.95 leaves just enough headroom for CUDA context. For larger batches drop max-model-len to 8192.
Monthly fixed cost
| Line | Cost |
|---|---|
| 4090 dedicated server | £500-650 |
| Egress (1 Gbps unmetered) | £0 |
| Storage 2 TB NVMe | included |
| IPv4 + remote hands | included |
| Monitoring (Prometheus + Grafana on host) | £0 (self-hosted) |
| Engineer time (1 hr/week) | ~£350/month |
| Total cash outlay | ~£500-650 (~$640-820) |
| Total loaded cost | ~£850-1000 (~$1,090-1,280) |
For modelling we will use $700/month cash outlay and $1,150/month loaded cost as midpoints.
Throughput in production
vLLM 0.6 with continuous batching produces these numbers on the 4090:
| Concurrent streams | Per-stream t/s | Aggregate t/s | TTFT median |
|---|---|---|---|
| 1 | 24 | 24 | 110 ms |
| 2 | 22 | 44 | 140 ms |
| 4 | 18 | 72 | 220 ms |
| 6 | 15 | 90 | 320 ms |
| 8 | 12 | 96 | 500 ms |
Batch 4-6 is the sweet spot, around 72-90 aggregate t/s. We model on 80 t/s sustained for the rest of the article. For raw concurrency see concurrent users.
Tokens per month and volume tables
| Utilisation | Tokens/day | Tokens/month | Realistic workload |
|---|---|---|---|
| 20% (4-5 h/day busy) | 1.4 M | 42 M | Internal tool, small team |
| 50% (12 h/day busy) | 3.5 M | 105 M | SMB chat, business hours |
| 70% | 4.8 M | 145 M | Production B2B SaaS |
| 90% (sustained) | 6.2 M | 187 M | Continuous batch jobs |
| 100% theoretical | 6.9 M | 207 M | Capacity ceiling |
How long does it take a 4090 to serve different volume targets?
| Volume target | Wall clock at 80 t/s | Monthly utilisation needed |
|---|---|---|
| 10 M tokens | 1.4 days | 5% |
| 100 M tokens | 14.5 days | 48% |
| 1 B tokens | 145 days | Above capacity (need 5x 4090) |
| 10 B tokens | 1,450 days | Above capacity (need 50x 4090) |
The 70B model on a single 4090 hits its capacity ceiling around 200 M tokens/month. For higher volumes either run smaller models (Qwen 32B does 654 M/month, see Qwen 32B cost) or scale horizontally across multiple cards via multi-card pairing.
Effective $/M token and MAU break-even
| Utilisation | Tokens/month | Cost/M tokens (cash) | Cost/M tokens (loaded) |
|---|---|---|---|
| 20% | 42 M | $16.67 | $27.40 |
| 50% | 105 M | $6.67 | $10.95 |
| 70% | 145 M | $4.83 | $7.93 |
| 90% | 187 M | $3.74 | $6.15 |
| 100% | 207 M | $3.38 | $5.55 |
MAU break-even, assuming 25k tokens per active user per month (typical chat/RAG):
| Utilisation | Tokens/month | MAU served | Cost per MAU |
|---|---|---|---|
| 50% | 105 M | 4,200 | $0.17 |
| 70% | 145 M | 5,800 | $0.12 |
| 90% | 187 M | 7,500 | $0.09 |
12-month TCO vs hosted APIs
Hosted Llama 3.1 70B (Together, Fireworks, DeepInfra) averages $0.85-0.90 per million tokens blended. Compare 12-month TCO at three volume points:
| Volume / month | Together Llama 70B 12-mo | 4090 dedicated 12-mo (cash) | Winner |
|---|---|---|---|
| 10 M tokens | $106 | $8,400 | Together by 80x |
| 50 M tokens | $528 | $8,400 | Together by 16x |
| 100 M tokens | $1,056 | $8,400 | Together by 8x |
| 200 M tokens (4090 cap) | $2,112 | $8,400 | Together by 4x |
| 500 M tokens | $5,280 | 3×4090 = $25,200 | Together by 5x |
| vs GPT-4o $5/M, 100 M | $6,000 (GPT-4o) | $8,400 | API by 1.4x |
| vs GPT-4o $5/M, 200 M | $12,000 (GPT-4o) | $8,400 | 4090 by 1.4x |
| vs Claude Sonnet $7/M, 100 M | $8,400 | $8,400 | Tie |
| vs Claude Sonnet $7/M, 200 M | $16,800 | $8,400 | 4090 by 2x |
The 70B-on-4090 economics work best when you compare against premium frontier APIs (GPT-4o, Sonnet, Opus), not against hosted Llama. For absolute cheapest cents-per-token Llama, hosted Together is hard to beat. For everything else, dedicated wins.
| Comparison | API | 4090 dedicated |
|---|---|---|
| $/M Llama 3 70B | $0.88 | $3.74-16.67 |
| $/M GPT-4o | $5.00 | $3.74 |
| $/M Claude Sonnet | $7.00 | $3.74 |
| $/M Claude Opus | $35.00 | $3.74 |
| Data residency | US/multi | UK |
| Privacy | provider sees prompts | your box |
| Custom fine-tunes | limited or none | any LoRA via QLoRA |
| Rate limits | per-tier caps | your hardware only |
| Latency from UK | ~250 ms TTFT | ~110 ms TTFT |
Production gotchas
- KV cache OOM on long prompts: a 70B with 16k context can consume 8+ GB of KV per stream. Cap
max-model-lenaggressively or you OOM under burst. - FP8 KV cache rounding: at very long contexts (>12k) FP8 KV starts to drift on attention scores. For long-context legal/medical workloads use BF16 KV at the cost of half the concurrency.
- Together rate limits: hosted 70B has per-org QPS caps; if your spike exceeds, you get 429s. Self-hosted has no such cap until VRAM saturates.
- AWQ accuracy regression: INT4 loses 1-2 points on MMLU vs BF16. For most chat that is invisible; for hard reasoning eval at scale, consider FP8 deployment on a 5090 or H100 instead.
- vLLM updates breaking AWQ: vLLM is on a fast release cadence and AWQ support has had churn. Pin
vllm==0.6.3and test before upgrading. - Cold-start cost: 70B AWQ takes ~90 seconds to load. Persistence mode and a warm pool process matter; do not cycle the server casually.
- Power and thermals at 4 streams sustained: 430-440 W draw for hours; verify thermal performance ahead of multi-day runs.
Verdict
Self-hosted Llama 3.1 70B AWQ on a single 4090 lands at $3.74-6.67 per million tokens at 70-90% utilisation. That beats GPT-4o ($5) and Claude Sonnet ($7) on cash $/M, and it is significantly cheaper at frontier-quality output than per-token APIs once you exceed roughly 100-150 M tokens per month. For raw cheapest-Llama, hosted Together is unbeatable; for privacy, UK residency, custom adapters and predictable bills, dedicated 4090 wins. Cross-check the case with the vs OpenAI, vs Anthropic and break-even calculator.
Self-host Llama 3.1 70B in the UK
AWQ INT4 on a single 4090, dedicated. UK dedicated hosting.
Order the RTX 4090 24GBSee also: 70B INT4 deployment, 70B INT4 benchmark, AWQ guide, vs OpenAI, vs Anthropic, vs Together AI, break-even, monthly hosting cost, 4090 for Llama 70B.