RTX 3050 - Order Now
Home / Blog / Cost & Pricing / Cost of Serving Llama 3 70B AWQ on RTX 4090 24GB: Capacity, $/M and Break-Even
Cost & Pricing

Cost of Serving Llama 3 70B AWQ on RTX 4090 24GB: Capacity, $/M and Break-Even

Monthly cost, throughput, MAU break-even and 12-month TCO of running Llama 3.1 70B AWQ INT4 on a single RTX 4090 24GB dedicated server.

Serving a 70B model on a single GPU used to mean an A100 80 GB at $2-3 per hour. Thanks to AWQ INT4 quantisation and vLLM’s marlin kernels, the RTX 4090 24GB dedicated server can host Llama 3.1 70B comfortably and serve it at a price-per-million-tokens that demolishes hosted APIs above modest volumes. This article works the numbers from monthly fee through capacity to effective $/M token, with MAU break-even tables and a 12-month TCO comparison on GigaGPU dedicated hosting.

Contents

Llama 3.1 70B AWQ on a 4090

AWQ packs the 70B weights at 4 bits per weight, around 18.9 GB resident. Add KV cache, vLLM scheduler overhead and CUDA context: ~22-23 GB on a 24 GB card. With --gpu-memory-utilization 0.95 and a 16k-token context cap, vLLM serves 4 concurrent streams comfortably with KV-cache headroom for prefill bursts.

QuantisationWeights VRAMDecode t/sNotes
AWQ INT4 (marlin)18.9 GB22-24Production sweet spot
GPTQ INT418.9 GB20-22Marginally slower; widely supported
GGUF Q4_K_M~21 GB14-16llama.cpp; no vLLM batching
FP8 (dual GPU)n/a singlen/aNeeds 2x card with NVLink-style fabric
BF16140 GBn/a singleMulti-card territory

For the deployment recipe see the Llama 70B INT4 deployment guide; for the underlying benchmark numbers see 70B INT4 benchmark.

vLLM launch and tuning

The canonical launch line, with FP8 KV cache to claw back VRAM:

python -m vllm.entrypoints.openai.api_server \
  --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
  --quantization awq_marlin --kv-cache-dtype fp8 \
  --max-model-len 16384 --max-num-seqs 4 \
  --gpu-memory-utilization 0.95

Key choices: awq_marlin picks the optimised Ada kernel; kv-cache-dtype fp8 halves KV memory at no measurable accuracy cost on Ada; max-num-seqs 4 matches the achievable concurrency given KV constraints; gpu-memory-utilization 0.95 leaves just enough headroom for CUDA context. For larger batches drop max-model-len to 8192.

Monthly fixed cost

LineCost
4090 dedicated server£500-650
Egress (1 Gbps unmetered)£0
Storage 2 TB NVMeincluded
IPv4 + remote handsincluded
Monitoring (Prometheus + Grafana on host)£0 (self-hosted)
Engineer time (1 hr/week)~£350/month
Total cash outlay~£500-650 (~$640-820)
Total loaded cost~£850-1000 (~$1,090-1,280)

For modelling we will use $700/month cash outlay and $1,150/month loaded cost as midpoints.

Throughput in production

vLLM 0.6 with continuous batching produces these numbers on the 4090:

Concurrent streamsPer-stream t/sAggregate t/sTTFT median
12424110 ms
22244140 ms
41872220 ms
61590320 ms
81296500 ms

Batch 4-6 is the sweet spot, around 72-90 aggregate t/s. We model on 80 t/s sustained for the rest of the article. For raw concurrency see concurrent users.

Tokens per month and volume tables

UtilisationTokens/dayTokens/monthRealistic workload
20% (4-5 h/day busy)1.4 M42 MInternal tool, small team
50% (12 h/day busy)3.5 M105 MSMB chat, business hours
70%4.8 M145 MProduction B2B SaaS
90% (sustained)6.2 M187 MContinuous batch jobs
100% theoretical6.9 M207 MCapacity ceiling

How long does it take a 4090 to serve different volume targets?

Volume targetWall clock at 80 t/sMonthly utilisation needed
10 M tokens1.4 days5%
100 M tokens14.5 days48%
1 B tokens145 daysAbove capacity (need 5x 4090)
10 B tokens1,450 daysAbove capacity (need 50x 4090)

The 70B model on a single 4090 hits its capacity ceiling around 200 M tokens/month. For higher volumes either run smaller models (Qwen 32B does 654 M/month, see Qwen 32B cost) or scale horizontally across multiple cards via multi-card pairing.

Effective $/M token and MAU break-even

UtilisationTokens/monthCost/M tokens (cash)Cost/M tokens (loaded)
20%42 M$16.67$27.40
50%105 M$6.67$10.95
70%145 M$4.83$7.93
90%187 M$3.74$6.15
100%207 M$3.38$5.55

MAU break-even, assuming 25k tokens per active user per month (typical chat/RAG):

UtilisationTokens/monthMAU servedCost per MAU
50%105 M4,200$0.17
70%145 M5,800$0.12
90%187 M7,500$0.09

12-month TCO vs hosted APIs

Hosted Llama 3.1 70B (Together, Fireworks, DeepInfra) averages $0.85-0.90 per million tokens blended. Compare 12-month TCO at three volume points:

Volume / monthTogether Llama 70B 12-mo4090 dedicated 12-mo (cash)Winner
10 M tokens$106$8,400Together by 80x
50 M tokens$528$8,400Together by 16x
100 M tokens$1,056$8,400Together by 8x
200 M tokens (4090 cap)$2,112$8,400Together by 4x
500 M tokens$5,2803×4090 = $25,200Together by 5x
vs GPT-4o $5/M, 100 M$6,000 (GPT-4o)$8,400API by 1.4x
vs GPT-4o $5/M, 200 M$12,000 (GPT-4o)$8,4004090 by 1.4x
vs Claude Sonnet $7/M, 100 M$8,400$8,400Tie
vs Claude Sonnet $7/M, 200 M$16,800$8,4004090 by 2x

The 70B-on-4090 economics work best when you compare against premium frontier APIs (GPT-4o, Sonnet, Opus), not against hosted Llama. For absolute cheapest cents-per-token Llama, hosted Together is hard to beat. For everything else, dedicated wins.

ComparisonAPI4090 dedicated
$/M Llama 3 70B$0.88$3.74-16.67
$/M GPT-4o$5.00$3.74
$/M Claude Sonnet$7.00$3.74
$/M Claude Opus$35.00$3.74
Data residencyUS/multiUK
Privacyprovider sees promptsyour box
Custom fine-tuneslimited or noneany LoRA via QLoRA
Rate limitsper-tier capsyour hardware only
Latency from UK~250 ms TTFT~110 ms TTFT

Production gotchas

  1. KV cache OOM on long prompts: a 70B with 16k context can consume 8+ GB of KV per stream. Cap max-model-len aggressively or you OOM under burst.
  2. FP8 KV cache rounding: at very long contexts (>12k) FP8 KV starts to drift on attention scores. For long-context legal/medical workloads use BF16 KV at the cost of half the concurrency.
  3. Together rate limits: hosted 70B has per-org QPS caps; if your spike exceeds, you get 429s. Self-hosted has no such cap until VRAM saturates.
  4. AWQ accuracy regression: INT4 loses 1-2 points on MMLU vs BF16. For most chat that is invisible; for hard reasoning eval at scale, consider FP8 deployment on a 5090 or H100 instead.
  5. vLLM updates breaking AWQ: vLLM is on a fast release cadence and AWQ support has had churn. Pin vllm==0.6.3 and test before upgrading.
  6. Cold-start cost: 70B AWQ takes ~90 seconds to load. Persistence mode and a warm pool process matter; do not cycle the server casually.
  7. Power and thermals at 4 streams sustained: 430-440 W draw for hours; verify thermal performance ahead of multi-day runs.

Verdict

Self-hosted Llama 3.1 70B AWQ on a single 4090 lands at $3.74-6.67 per million tokens at 70-90% utilisation. That beats GPT-4o ($5) and Claude Sonnet ($7) on cash $/M, and it is significantly cheaper at frontier-quality output than per-token APIs once you exceed roughly 100-150 M tokens per month. For raw cheapest-Llama, hosted Together is unbeatable; for privacy, UK residency, custom adapters and predictable bills, dedicated 4090 wins. Cross-check the case with the vs OpenAI, vs Anthropic and break-even calculator.

Self-host Llama 3.1 70B in the UK

AWQ INT4 on a single 4090, dedicated. UK dedicated hosting.

Order the RTX 4090 24GB

See also: 70B INT4 deployment, 70B INT4 benchmark, AWQ guide, vs OpenAI, vs Anthropic, vs Together AI, break-even, monthly hosting cost, 4090 for Llama 70B.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?