Table of Contents
- What “Cloud” and “Dedicated” Actually Mean
- Pricing Per Hour: Real 2026 Numbers
- The Hidden Costs Cloud Bills Don’t Show
- Break-Even Maths: When Dedicated Wins
- Use Case Decision Matrix
- Performance: Noisy Neighbours vs Predictable Iron
- Compliance, Data Residency, and the GDPR Question
- Migration Path: Prototype on Cloud, Scale on Dedicated
- Verdict
The “cloud versus dedicated” debate has been going on for fifteen years in CPU land, but GPUs make the maths sharper. A general-purpose VM idling at 5% CPU is a rounding error on your bill. A GPU instance idling at 5% utilisation is haemorrhaging money — somewhere between £0.30 and £30 per hour, depending on which provider you picked. That single fact is what makes 2026’s procurement decision different from any other infrastructure call you’ll make this year.
This post compares GPU cloud (hourly, on-demand, multi-tenant) against dedicated GPU servers (monthly, single-tenant, real silicon you reserve outright). We’ll show real numbers from AWS, GCP, RunPod, Vast.ai, Lambda Labs, and our own UK-based dedicated GPU hosting. We’ll be honest about where cloud wins. And we’ll give you a break-even calculator so you can settle the argument with a spreadsheet, not a sales pitch.
What “Cloud” and “Dedicated” Actually Mean
Both labels get abused, so let’s pin them down before the comparison starts.
GPU cloud is the hyperscaler and neo-cloud model. You request a GPU instance through an API or console, you get a VM (or container) running on hardware you don’t see, you’re billed per second or per hour, and when you stop the instance, the GPU returns to a shared pool that other tenants will draw from. Examples: AWS EC2 P4/P5 instances, GCP A2/A3, Azure ND-series, RunPod, Vast.ai, Lambda Labs on-demand, Coreweave on-demand, Paperspace.
Dedicated GPU server is the colocation and managed-hosting model. You commit to a specific physical machine — typically for a month minimum — with a real GPU bolted into it. The hardware is yours for the duration. Nobody else schedules workloads on it. You pay a flat monthly fee whether the GPU sits at 0% or 99% utilisation. Examples: gigagpu, Hetzner GPU, OVH BareMetal GPU, Scaleway Elastic Metal GPU, LeaseWeb.
There’s a hybrid: reserved cloud capacity (AWS Savings Plans, GCP CUDs, Lambda 1-year reservations). You commit to 1 or 3 years of cloud usage in exchange for a 30-60% discount. The pricing economics start to resemble dedicated, but you still inherit the cloud’s noisy-neighbour and egress problems. We’ll treat reserved cloud as a separate row in the tables.
Pricing Per Hour: Real 2026 Numbers
Here are the rates as of early 2026, normalised to per-hour costs. Cloud prices are taken from public list pages (UK or EU regions where available); dedicated prices are converted from monthly rates assuming 730 hours per month.
| Provider | GPU | Model | Per-hour cost |
|---|---|---|---|
| AWS p4d.24xlarge | 8× A100 40GB | On-demand | ~$32.77/hr (~$4.10/GPU) |
| AWS p4d.24xlarge | 8× A100 40GB | 1-yr reserved | ~$19.20/hr (~$2.40/GPU) |
| AWS p5.48xlarge | 8× H100 80GB | On-demand | ~$98.32/hr (~$12.29/GPU) |
| GCP a2-highgpu-1g | 1× A100 40GB | On-demand | ~$3.67/hr |
| GCP a3-highgpu-1g | 1× H100 80GB | On-demand | ~$11.06/hr |
| Lambda Labs | 1× A100 80GB | On-demand | ~$1.29/hr |
| Lambda Labs | 1× H100 80GB | On-demand | ~$2.49/hr |
| RunPod (community) | 1× RTX 4090 24GB | Spot-style | ~$0.34/hr |
| RunPod (secure) | 1× RTX 4090 24GB | On-demand | ~$0.69/hr |
| Vast.ai | 1× RTX 4090 24GB | Marketplace | ~$0.30-0.50/hr |
| gigagpu | 1× RTX 4090 24GB | Dedicated monthly | ~£0.19/hr equiv |
| gigagpu | 1× RTX 5060 Ti 16GB | Dedicated monthly | ~£0.10/hr equiv |
| gigagpu | 1× RTX 3090 24GB | Dedicated monthly | ~£0.14/hr equiv |
Two observations jump out. First, the consumer-class spot marketplaces (RunPod community, Vast.ai) are genuinely cheap per hour — cheaper than dedicated on a raw rate, if you can tolerate interruptions. Second, the hyperscalers (AWS, GCP) charge a premium of 5-20× over neo-clouds for equivalent silicon, mostly because you’re paying for the surrounding ecosystem (IAM, VPC, managed services), not the GPU itself.
For a deeper look at how these rates translate into per-token economics, see cost per 1M tokens: GPU vs OpenAI API and our TCO comparison of dedicated GPU vs cloud rental.
The Hidden Costs Cloud Bills Don’t Show
The per-hour GPU rate is only one line on a cloud invoice. The bill includes a stack of supporting services that look small individually and ruinous in aggregate. This is the section every cloud calculator quietly omits.
| Hidden cost | Typical rate (AWS) | What it actually means |
|---|---|---|
| Egress bandwidth | $0.09/GB outbound | Pulling a 100GB Llama checkpoint to your laptop = $9. Streaming inference responses to 10k users/day at 4KB each = ~$108/month. |
| EBS persistent storage | $0.08/GB-month (gp3) | 500GB workspace stays billable when the GPU instance is stopped. ~$40/month for an idle disk. |
| Snapshot retention | $0.05/GB-month | Weekly checkpoints of a 200GB workspace add ~$50/month silently. |
| Cross-AZ data transfer | $0.01-0.02/GB | Multi-AZ training with sharded datasets can quietly add hundreds per month. |
| NAT Gateway | $0.045/hr + $0.045/GB | Required for private-subnet GPU instances pulling pip packages or model weights. |
| Spot interruption recovery | Engineering time | You lose state mid-training. Add checkpointing complexity, retry logic, and lost compute time. |
| Idle GPU billing | Full hourly rate | Forgot to stop a $32/hr p4d over the weekend? That’s $1,536. Real story for many teams. |
Dedicated hosting collapses most of these to zero. At gigagpu, monthly pricing includes generous bandwidth allowances over UK colocation, persistent local NVMe storage at no extra charge, and no egress fees for normal workloads. There are no spot interruptions because nobody is competing with you for the GPU. You can leave it idle for a week if you want — the bill is the same.
This is also why AWS bills are notoriously hard to predict. A team that estimates $2,000/month based on the GPU sticker price routinely sees $3,500 once egress, EBS, snapshots, and NAT are added. We’ve spoken to engineering leads who discovered their actual cloud spend was 70% non-GPU services.
Break-Even Maths: When Dedicated Wins
Here is the question that settles the argument: at what monthly utilisation does dedicated beat cloud?
The maths is simple. Dedicated has a fixed monthly cost. Cloud has a per-hour cost. Find the hour count where they cross.
Worked example: RTX 4090.
- RunPod community RTX 4090: $0.34/hr × 730 hours = $248/month at 100% utilisation
- RunPod secure RTX 4090: $0.69/hr × 730 hours = $504/month at 100%
- gigagpu dedicated RTX 4090: ~£140/month flat (~$175/month at current FX)
- Break-even vs RunPod community: $175 / $0.34 = ~514 hours/month (70% utilisation)
- Break-even vs RunPod secure: $175 / $0.69 = ~253 hours/month (35% utilisation)
So if your workload runs more than about 8 hours a day on average, dedicated already wins against the cheapest community spot pricing. Against the secure on-demand rate, dedicated wins after roughly 4 hours per day. Against AWS p4d? You break even after about 5 hours per month.
| Cloud option (RTX 4090) | Hourly | Hours/month to match £140 dedicated | Equivalent daily usage |
|---|---|---|---|
| Vast.ai (cheapest) | $0.30 | ~583 hr | ~19 hr/day |
| RunPod community | $0.34 | ~514 hr | ~17 hr/day |
| RunPod secure | $0.69 | ~253 hr | ~8.5 hr/day |
| Lambda A100 (closest equivalent) | $1.29 | ~136 hr | ~4.5 hr/day |
| AWS p4d (single A100 share) | ~$4.10 | ~43 hr | ~1.4 hr/day |
The break-even shifts when you add the hidden costs from the previous section. A workload that pulls 500GB of data per month from cloud egress at $0.09/GB adds $45 to the cloud side, lowering the break-even threshold by another 130 hours on community RunPod. We expanded this analysis at GPU vs API pricing: the self-hosting break-even if you want the API-vs-self-host angle as well.
For exact pricing on the 4090 specifically, see RTX 4090 24GB monthly hosting cost and the full spec breakdown.
Use Case Decision Matrix
Break-even is one input. The other is workload shape. Some workloads are genuinely better on cloud regardless of cost. Here’s how we’d route them in 2026:
| Workload | Recommended | Why |
|---|---|---|
| One-off training run, <100 hours total | Cloud (RunPod, Vast.ai) | You’ll never amortise a monthly server. Hourly billing wins. |
| Burst inference for prototyping, <50 hr/wk | Cloud | Variable, low-utilisation. Pay only when in use. |
| Production inference 24/7 | Dedicated | Steady utilisation, predictable bill, no cold starts, no noisy neighbours. |
| Periodic batch jobs (weekly retraining) | Cloud spot or hybrid | Tolerate interruptions, only pay for the few hours per week you need. |
| Long fine-tune (weeks of continuous training) | Dedicated or reserved cloud | Avoids interruption risk; dedicated also avoids egress charges on checkpoints. |
| Multi-tenant SaaS, steady traffic | Dedicated cluster | Predictable cost per user, no surprise bills when traffic spikes. |
| Latency-critical (sub-100ms p99) | Dedicated | Cloud serverless GPU has 1-30s cold starts. Always-warm dedicated wins. |
| Research lab with bursty experiments | Hybrid: small dedicated baseline + cloud for spikes | Best of both — known-cost baseline plus elastic capacity. |
| Training H100-class jobs (no consumer GPU works) | Cloud (Lambda, Coreweave) | H100s are scarce on monthly contracts; cloud is the realistic option for short bursts. |
| Compliance-bound workloads (UK data residency, GDPR) | Dedicated UK | Single-tenant hardware, known location, no cross-border data flows. |
Notice the pattern. Cloud wins on the edges — very short workloads, very bursty workloads, very rare hardware. Dedicated wins in the middle — anything sustained, anything latency-sensitive, anything compliance-bound.
For inference specifically, see best GPU for LLM inference and cheapest GPU for AI inference to pick the right card for your workload before you commit.
Performance: Noisy Neighbours vs Predictable Iron
Cost is not the only axis. Performance characteristics differ in ways that don’t show up on a price page.
Shared infrastructure on cloud GPUs. Even when you “have” a GPU instance, the surrounding NVMe and network are usually shared with other tenants on the host. We’ve measured 2-4× variance in NVMe throughput on RunPod community instances depending on time of day and which neighbours were active. AWS p4d uses local NVMe and is more consistent, but the EBS-backed instances (cheaper) suffer noticeable IOPS contention.
Cold starts. Serverless GPU offerings (Modal, Replicate, RunPod serverless) have cold-start penalties between 1 second (warm pool) and 30+ seconds (true cold). For a chatbot, a 30-second cold start is a UX disaster. Dedicated GPU is always warm — your model loaded, your CUDA context initialised, your first token in milliseconds.
Network egress speeds. AWS limits cross-region transfer to ~5 Gbps per stream. Pulling a 100GB checkpoint from S3 to a p4d in another region takes a real chunk of time. Dedicated boxes typically have 1-10 Gbps unmetered uplinks and let you saturate them without per-GB charges.
Predictability. When you benchmark your inference engine (vLLM, TGI, llama.cpp) on a dedicated server, the numbers you get on Tuesday are the numbers you’ll get on Saturday at 3am. On a shared cloud GPU, the same workload can vary 10-20% between runs because of host-level contention. For SLA-bound services, that variance is genuinely expensive — you have to over-provision to hit p99 targets.
| Performance dimension | Cloud (typical) | Dedicated |
|---|---|---|
| NVMe IOPS variance | ±30-200% | ±5% |
| Cold start to first token | 1-30 seconds | 0 (always warm) |
| Network egress to internet | Throttled, billed per GB | Line-rate, included |
| GPU clock predictability | Subject to thermal throttling on shared hosts | Stable; you know the chassis |
| SSH/console latency | Cloud-managed (typically fine) | Direct (typically faster from UK) |
If you want to measure these characteristics on your own server, our guide on monitoring GPU usage on a dedicated server shows the tooling.
Compliance, Data Residency, and the GDPR Question
For UK and EU teams handling personal data, dedicated single-tenant hardware in a known jurisdiction is structurally easier to defend than multi-tenant cloud. Three reasons.
Data residency is unambiguous. A dedicated server in a UK data centre never moves. A cloud workload technically lives in a region you select, but management traffic, control planes, and support access can cross borders depending on the provider’s architecture. For data subject access requests and Article 30 records of processing activities, dedicated is simpler to document.
Single-tenancy reduces shared-fate risk. Multi-tenant GPU hosts have, on rare occasions, leaked state between tenants through shared L2 cache, NVLink topology, or VBIOS. The risk is small but real, and it shows up in security questionnaires from regulated buyers (healthcare, finance, government). Dedicated removes the question entirely.
HIPAA and similar frameworks. AWS and GCP both offer HIPAA-eligible services, but the BAA scope is narrow and changes per service. Running a HIPAA-bound workload on a dedicated server you control end-to-end is simpler — you write your own controls, you audit your own machine.
This doesn’t mean cloud is non-compliant. It means the compliance burden is heavier, and the documentation is fiddlier. For workloads where the data is the regulated asset, dedicated is the path of least resistance.
Migration Path: Prototype on Cloud, Scale on Dedicated
The most common pattern we see, and the one we recommend, is hybrid by lifecycle stage.
Stage 1: prototype. Spin up RunPod or Vast.ai for a week. Validate that your model, your code, and your assumptions all work. Pay $20-100. If the prototype dies, you’ve lost nothing. If it survives, you have a baseline to compare against.
Stage 2: pilot. Move to a single dedicated GPU (an RTX 3090 or RTX 4090 is usually enough). Run real traffic for a month. You now have actual utilisation numbers, actual p99 latency, actual egress volume. If utilisation is >30% and growing, you’ve made the right call.
Stage 3: production. Provision the dedicated GPU class your benchmarks justify. Add a second box for redundancy. Configure your inference stack — see setting up vLLM in production and installing PyTorch on a GPU server. Keep a small cloud-burst budget for unexpected spikes or one-off training runs.
Stage 4: scale. Multiple dedicated boxes. Load balancing across them. Cloud is now the exception, not the rule — used only for hardware you don’t have on dedicated (an H100 burst, for example) or for emergency capacity.
Our self-host LLM guide walks through the technical migration end-to-end, including weight transfer, inference engine setup, monitoring, and TLS termination.
Verdict
For experiments, prototypes, rare hardware, and genuinely bursty workloads, GPU cloud wins on flexibility. RunPod and Vast.ai in particular offer better per-hour rates than the hyperscalers and are the right starting point for almost any new project.
For sustained workloads above roughly 40% utilisation — and that includes any serious production inference, any continuous fine-tune, and any latency-sensitive service — dedicated wins on cost, performance, and compliance. The break-even is genuinely low: 4-8 hours per day of average usage is enough to make the maths work, and the predictability is a separate benefit you only appreciate once you’ve had your first surprise cloud bill.
Our pitch is straightforward: prototype anywhere, scale on dedicated. We host UK-based dedicated GPUs from £0.10/hr equivalent (RTX 5060 Ti) up through 4090s and beyond, with bandwidth included and no surprise charges. Browse dedicated GPU plans on the gigagpu portal and pick the card that matches your workload — or talk to us first if you want help with the break-even spreadsheet for your specific use case.