Table of Contents
"Should I use serverless GPUs or rent a dedicated server?" is the most common architectural question we get from teams launching their first AI product. The honest answer is: it depends on your traffic shape, latency budget, and how much engineering you want to spend on cold-start mitigation. Here’s the math.
Serverless wins when traffic is bursty (<30% utilisation), workloads tolerate 5–60s cold starts, and you don’t want to manage infrastructure. Dedicated wins when traffic is steady (>30% utilisation), latency must be sub-second every request, you need data residency, or you’re hitting per-token API costs above £1,500/mo. Break-even is roughly 30 hours of GPU-time per month at flagship prices.
Definitions: what each one actually is
Serverless GPU
You define a function (or container) that runs a model. The platform spins up a GPU container when a request arrives, runs the inference, and tears it down after a few seconds of idle. Pricing is per-second-of-execution. Examples: RunPod Serverless, Modal, Replicate, AWS Inferentia (sort of), Banana, Cerebrium.
Dedicated GPU
You rent a physical server with one or more GPUs by the month. The server is yours 24/7. Pricing is fixed-monthly. Examples: GigaGPU, RunPod Dedicated Pods, Lambda Reserved, Hetzner.
The hidden third option: hosted APIs
OpenAI, Anthropic, Together, Fireworks. Per-token billing. Not the same shape as either of the above but worth keeping in mind for the cost math.
Cost models compared
For a Mistral 7B FP16 deployment serving roughly 100K requests/day, ~256 tokens out per request:
| Option | Pricing model | Effective monthly cost | Notes |
|---|---|---|---|
| Serverless (RunPod RTX 5090) | $0.00097/s | ~£700/mo | Assumes 95% cold-start avoidance |
| Serverless (Modal H100) | $0.0005/s | ~£900/mo | Higher per-second, higher throughput |
| Dedicated RTX 5090 | Flat | £399/mo | Includes 100% utilisation budget |
| Hosted API (OpenAI gpt-4o-mini) | $0.15/1M in + $0.60/1M out | ~£600/mo | Per-token at moderate volume |
| Hosted API (Together Llama 3 70B) | $0.88/1M | ~£420/mo | For 70B-class queries |
The serverless and hosted-API numbers vary wildly with utilisation. The dedicated number doesn’t.
Latency: cold start vs warm start
Dedicated GPU latency is whatever your model + framework gives you — usually 50–200 ms TTFT for a 7B-class model on Blackwell. No cold start, ever.
Serverless GPU latency is two numbers:
- Warm requests: similar to dedicated. Maybe +20 ms for routing.
- Cold requests: container spin-up + model load + first token. Typical: 5–60 seconds.
You can mitigate cold starts with:
- Provisioned concurrency (keep N containers warm) — but that’s just dedicated GPU with extra steps
- Faster container images (drop down to slim base, pre-load weights)
- Smaller models (Phi-3 Mini cold-loads faster than Llama 70B)
- Predictive warming (spin up containers based on traffic forecast)
For latency-sensitive interactive workloads (chatbots that need sub-second response), cold-start mitigation costs add up fast. We’ve seen teams spend more on idle warm containers than they would on a dedicated server.
The break-even calculation
Rough rule: if you’d run the GPU at >30% steady-state utilisation, dedicated is cheaper. The exact break-even for an RTX 5090 (£399/mo) vs RunPod Serverless RTX 5090 ($0.00097/s):
- Monthly seconds in dedicated: 30 × 86,400 = 2,592,000 s
- Cost on serverless at 100% util: 2,592,000 × $0.00097 = $2,514 ≈ £1,950
- Dedicated wins from ~6.5% utilisation upward ((£399 / £1,950) × 100%)
That’s roughly 2 hours of GPU-time per day. If your workload uses the GPU more than that, dedicated is dramatically cheaper.
Which workloads fit each model
What works
- Steady inference traffic (chatbots, RAG, embeddings) — dedicated wins on cost.
- Latency-sensitive interactive (voice agents, real-time copilots) — dedicated wins on latency.
- Data residency / compliance (healthcare, finance, government) — dedicated wins on control. See private AI hosting.
- Long-running batch (eval suites, dataset embedding) — dedicated wins on cost-per-job.
- Fine-tuning — dedicated almost always. Serverless billing for hours-long jobs is brutal.
- High-throughput single-server — once you saturate the card, dedicated is more cost-efficient.
Where it breaks
- Bursty / spiky traffic (image generation API with unpredictable peaks) — serverless wins.
- Multi-model fan-out (50+ different models, low traffic each) — serverless wins.
- Edge cases / experimentation (try a model for 2 hours, throw it away) — serverless wins.
- Geographic scale (need GPUs in 10 regions) — serverless makes more sense than 10 dedicated rentals.
- You don’t want ops — serverless trades cost for operational simplicity, and that’s a legitimate trade.
The hybrid pattern that works
The pattern we see most often in customers who graduate from serverless to dedicated:
- Dedicated for the steady core — your main chatbot, your embedding pipeline, your fine-tuned models. Predictable cost, low latency.
- Serverless for the spiky long tail — image generation triggered by user upload, occasional batch jobs, experimental models.
- Hosted APIs for genuinely intermittent — "summarise this once a week" doesn’t justify either dedicated or serverless infrastructure.
LiteLLM (or your own router) lets you fan out by route — /api/chat goes to the dedicated server, /api/image goes to RunPod Serverless, /api/summarise-weekly-report goes to Anthropic.
Bottom line
If your workload uses a GPU more than 2 hours a day on average, dedicated is cheaper. If it’s spikier than that or you’re still pre-product-market-fit, serverless or hosted APIs are the right starting point. Most teams start serverless, hit their first £1,500 monthly bill, and move the steady traffic to dedicated. There’s no shame in that path — just don’t wait too long.
If you’re evaluating: see our RunPod alternatives and cost per 1M tokens calculators for current pricing.