Headline GPU prices are the easy part. The harder, more honest comparison is twelve-month total cost of ownership for a real workload, including bandwidth, storage, the engineering hours each option actually consumes and the capacity ceiling each path implies. This article walks through that for an RTX 4090 24GB dedicated server against cloud GPU rentals and hosted APIs at three reference workload sizes (200 M, 1 B and 5 B tokens/month). Wider hardware menu on dedicated GPU hosting.
Contents
- Three reference workloads
- Three deployment options
- 12-month compute and infrastructure cost
- Bandwidth, storage and the cloud surprise
- Engineer-time costs nobody tracks
- Hidden costs and contingencies
- Capacity ceilings and scaling triggers
- Verdict by workload size
Three reference workloads
Real TCO depends on volume. We compare across three concrete shapes, each modelled on production deployments we have seen.
| Workload | Volume/month | Self-host model | 4090 utilisation | Typical product |
|---|---|---|---|---|
| A. Busy SMB chat or RAG | 200 M tok | Llama 3 8B FP8 | ~7% | Support assistant, internal tool |
| B. Established SaaS | 1 B tok | Qwen 14B AWQ | ~60% | Vertical assistant, doc workflow |
| C. Heavy traffic SaaS | 5 B tok | requires 2-3 cards | 175-200% | Coding assistant, agent platform |
Three deployment options
| Option | Hardware/Service | Pricing model | Notes |
|---|---|---|---|
| A. Dedicated | GigaGPU 4090 24 GB | ~£550 ($700)/mo flat | Bandwidth, storage, IPv4 included |
| B. Cloud GPU | AWS g6.4xlarge (L4 24 GB) | $1.32/h on-demand | L4 is slower than 4090; everything metered |
| C. Hosted API | OpenAI GPT-4o blended | $5/M tokens | Linear with volume; no infra to operate |
12-month compute and infrastructure cost
Workload A: 200 M tokens/month
| Line item | Dedicated 4090 | AWS g6.4xlarge (always-on) | OpenAI GPT-4o |
|---|---|---|---|
| Compute | $8,400 | $11,563 (730h x 12) | $12,000 |
| Storage 2 TB NVMe | included | $2,400 | n/a |
| Egress 5 TB/mo | included (1 Gbps unmetered) | $5,400 | n/a |
| Static IPv4 | included | $144 | n/a |
| Subtotal infrastructure | $8,400 | $19,507 | $12,000 |
Workload B: 1 B tokens/month
| Line item | Dedicated 4090 | AWS g6.4xlarge x2 | OpenAI GPT-4o |
|---|---|---|---|
| Compute | $8,400 | $23,126 | $60,000 |
| Storage / egress / IP | included | $15,888 | n/a |
| Subtotal | $8,400 | $39,014 | $60,000 |
Workload C: 5 B tokens/month
| Line item | 2x Dedicated 4090 | AWS g6.4xlarge x6 | OpenAI GPT-4o |
|---|---|---|---|
| Compute | $16,800 | $69,378 | $300,000 |
| Storage / egress / IP | included | $47,664 | n/a |
| Subtotal | $16,800 | $117,042 | $300,000 |
Bandwidth, storage and the cloud surprise
Dedicated 4090 includes 1 Gbps unmetered (around 320 TB/month theoretical), 2 TB NVMe and a static IPv4. Cloud equivalents charge per gigabyte for everything: AWS egress alone runs $0.09/GB after the free tier, so a media-heavy workload streaming back FLUX or SDXL outputs will see $400-700/month in egress. APIs ship JSON, so bandwidth is small but you pay per token regardless of cache locality. For an LLM-only workload, dedicated bandwidth is a “no thinking required” line; for media-heavy workloads it can dominate.
Hidden infrastructure surcharges
- AWS NAT gateway: ~$45/month plus $0.045/GB processed if your GPU sits in a private subnet.
- EBS snapshots and backups: $0.05/GB/month for any reasonable backup policy.
- CloudWatch logs and metrics: easy to add $50-200/month per active service.
- Reserved instance commitment: 1-year RI cuts compute ~30% but you commit to the spend.
Engineer-time costs nobody tracks
Engineer time is the line every TCO analysis forgets. Use £400/day blended (£100k/year senior with overhead at typical UK loaded cost). Activity estimates are conservative and based on what a competent infra engineer actually spends, not the optimistic LinkedIn version.
| Activity | Dedicated 4090 | Cloud GPU (AWS) | Hosted API |
|---|---|---|---|
| Initial setup (one-off) | 1 day (image, vLLM, monitor) | 3 days (AMI, IaC, autoscaling, IAM) | 0.5 days (key, SDK, gateway) |
| Ongoing ops/year | 5 days (upgrades, model swaps) | 15 days (cost firefighting, spot evictions, AMI rebuilds) | 3 days (rate-limit handling, model updates) |
| Cost firefighting / finops | 0 | +£3,000 (alerts, RI optimisation, cost reviews) | +£800 (rate-limit handling, billing surprise reviews) |
| Total engineer-time year 1 | ~£2,800 (7 days) | ~£10,200 (18 days + firefight) | ~£2,200 (3.5 days + firefight) |
The cloud-GPU number includes the costs nobody puts on a spreadsheet: re-baking AMIs after CUDA updates, debugging spot evictions at 02:00, finops calls about why the bill spiked. The dedicated number is honest because there is genuinely less to operate: one box, one OS, one inference server. The hosted-API number is low until you hit a rate limit at scale, at which point negotiating capacity with sales and re-architecting around quotas eats real time.
Hidden costs and contingencies
- Spot eviction risk on cloud GPU: 30-60% cheaper than on-demand but interrupts inference; not viable for production traffic, only for batch fine-tunes.
- API rate limits at scale: above 1 B tokens/month on OpenAI you negotiate quota with sales; takes 2-6 weeks and may require committed spend.
- Model deprecation on hosted APIs: GPT-3.5 to GPT-4 to GPT-4o migrations have each cost teams 2-5 days of prompt re-tuning. Self-hosted has no forced migrations.
- Data residency penalties: hosted APIs in non-EU regions can void GDPR compliance; consultancy and legal cost can dwarf compute.
- Quality regression on hosted-API silent updates: hosted models change behind your back; self-hosted is pinned to a SHA you control.
- Capacity ceiling on dedicated: there is one; once you hit it you add another card. The marginal token is free until the cap.
Capacity ceilings and scaling triggers
| Option | Tokens/month at this cost | Cost per extra M tokens | Scaling friction |
|---|---|---|---|
| Dedicated 4090 (8B FP8) | up to ~2.85 B (cap) | $0 until cap, then add another £550 box | Low: order, provision, mirror config |
| Dedicated 4090 (70B AWQ) | up to ~187 M (cap) | $0 until cap, then add another £550 box | Low |
| AWS L4 g6.4xlarge | scales with hours | ~$2.16 per M (slower than 4090) | Medium: autoscale config, spot risk |
| OpenAI GPT-4o | linear, capped by quota | $5.00 per M | High at scale: quota negotiation |
Dedicated hosting has a capacity cap, but until you hit it the marginal token is genuinely free. Cloud GPU and API both scale linearly: every extra million tokens costs the same as the first. For a growing workload, dedicated wins compounding: the first 4090 amortises faster the more you use it, and adding the second card doubles capacity at +£550/month, far cheaper than a doubled API bill. See when to upgrade and 5090 decision.
Verdict by workload size
Total 12-month TCO including infrastructure plus engineer time:
| Workload | Dedicated 4090 | AWS L4 cloud | OpenAI GPT-4o | Best option |
|---|---|---|---|---|
| A. 200 M tok/mo | $8,400 + £2,800 = ~$11,950 | $19,507 + £10,200 = ~$32,800 | $12,000 + £2,200 = ~$14,800 | Dedicated narrow win; API close second |
| B. 1 B tok/mo | $8,400 + £2,800 = ~$11,950 | $39,014 + £10,200 = ~$52,300 | $60,000 + £2,200 = ~$62,800 | Dedicated, by 5x |
| C. 5 B tok/mo | $16,800 + £4,200 = ~$22,200 | $117,042 + £15,000 = ~$135,000 | $300,000 + £3,000 = ~$303,800 | Dedicated, by 14x |
| Monthly volume | Best option | Why |
|---|---|---|
| 0-50 M tokens | OpenAI/Claude/Anthropic API | Below break-even; infra overhead unjustified |
| 50-150 M tokens | API or dedicated, close call | Choose by privacy, latency, model quality, not pure cost |
| 150-500 M tokens | Dedicated 4090 | Clear cost win; one box; predictable monthly |
| 500 M-1.5 B tokens | Dedicated 4090 with Qwen 14B/32B | Single 4090 still inside cap |
| 1.5-3 B tokens | 2x dedicated 4090 | Linear scale at half the cost of API |
| 3 B+ tokens | Multiple 4090s or 5090 | Move to denser deployment per 4090 vs 5090 |
Verdict
For Workload A (busy SMB) the dedicated 4090 narrowly beats the API on cost; the deciding factor is usually privacy or latency, not the line-item delta. For Workload B (established SaaS at 1 B tokens/month) the dedicated wins by 5x against GPT-4o-equivalent quality. For Workload C (heavy SaaS at 5 B tokens/month) dedicated wins by 14x against the API and by 6x against cloud GPU. Cloud GPU loses everywhere it is not on free credits; the L4 is slower than the 4090 and AWS metering compounds against you. For the formula behind the line items see the break-even calculator; for monthly cost detail see monthly hosting cost.
Predictable 12-month TCO, one flat invoice
No egress meter, no spot eviction, no quota negotiation. UK dedicated hosting.
Order the RTX 4090 24GBSee also: monthly cost, break-even calculator, vs RunPod, vs Lambda Labs, vs OpenAI, vs Anthropic, 70B monthly cost.