Cloud H100 80GB instances are the gold standard for transformer inference. They have the bandwidth (3.35 TB/s HBM3), the VRAM (80GB), the NVLink fabric (900 GB/s peer), and native FP8 with TMA – and they rent for $2-4/hour from the major providers, $98/hour for an 8x SXM5 baseboard. A dedicated RTX 4090 24GB at flat UK pricing covers the same use cases at a fraction of the cost – if your model fits. This guide draws the line between the two: when the H100 actually wins, when the 4090 is enough, the per-token economics across providers, and the workload scenarios that justify a 5-10x premium. Both options sit in the broader UK GPU range.
Contents
- Spec sheet
- H100 cloud rates today
- Throughput comparison
- When H100 wins decisively
- When 4090 wins on TCO
- Cost-per-token math across providers
- Three concrete scenarios
- Verdict and decision criteria
Spec sheet
| Spec | RTX 4090 24GB | H100 80GB SXM | H100 80GB PCIe |
|---|---|---|---|
| Architecture | Ada AD102 | Hopper GH100 | Hopper GH100 |
| VRAM | 24GB GDDR6X | 80GB HBM3 | 80GB HBM3 |
| Bandwidth | 1,008 GB/s | 3,350 GB/s | 2,000 GB/s |
| FP8 TFLOPS (sparse) | ~660 | ~3,958 | ~3,026 |
| NVLink | No | Yes (900 GB/s peer) | Optional bridge (600 GB/s) |
| MIG | No | Yes (7 partitions) | Yes (7 partitions) |
| TDP | 450W | 700W | 350W |
| Form factor | 3-slot consumer | SXM5 module | 2-slot PCIe |
| Native FP8 | Yes (Ada 4th gen tensor cores) | Yes (Hopper TMA-aware) | Yes |
| TMA (Tensor Memory Accelerator) | No | Yes | Yes |
The H100’s bandwidth advantage is the headline (3.3x the 4090) but the more important differences are NVLink fabric for multi-card scaling, MIG for tenant isolation, and TMA which accelerates async memory copies in Hopper-aware kernels. The full spec deep-dive is in the vs H100 post.
H100 cloud rates today
| Provider | SKU | $/hour | $/month always-on | £/month equivalent |
|---|---|---|---|---|
| RunPod Community | H100 80GB | $2.29 | $1,649 | ~£1,300 |
| RunPod Secure Cloud | H100 80GB | $2.99 | $2,153 | ~£1,700 |
| Lambda Labs 1-Click | H100 PCIe | $2.49 | $1,793 | ~£1,420 |
| Vast.ai | H100 80GB | $1.50-3.00 | $1,080-2,160 | ~£855-1,710 |
| AWS p5.48xlarge | 8x H100 SXM | $98.32 | $70,791 | ~£56,100 |
| Azure ND H100 v5 | 1x H100 SXM | $3.65 | $2,628 | ~£2,080 |
| GCP A3 High | 1x H100 SXM | $3.40 | $2,448 | ~£1,940 |
| UK dedicated 4090 | RTX 4090 24GB | n/a | ~$700 | £550-575 |
| UK dedicated 5090 | RTX 5090 32GB | n/a | ~$1,150 | £900 |
| UK dedicated 6000 Pro | RTX 6000 Pro 96GB | n/a | ~$2,800 | £2,200 |
The headline number: RunPod Community H100 at $2.29/hr always-on is roughly $1,649/month – 2.4x the cost of a dedicated 4090. Lambda Labs H100 PCIe is 2.6x. AWS p5.48xlarge for 24/7 is $70k/month, which is only justified at petabyte-scale or sustained training. The RunPod pricing and Lambda Labs posts cover provider-specific gotchas.
Throughput comparison
| Workload | 4090 t/s | H100 SXM t/s | H100 PCIe t/s | H100 advantage |
|---|---|---|---|---|
| Llama 3.1 8B FP8 batch 1 | 198 | 330 | 285 | 1.67-1.44x |
| Llama 3.1 8B FP8 aggregate batch 32 | 1,100 | 2,200 | 1,900 | 2.0-1.7x |
| Llama 3.1 70B AWQ INT4 batch 1 | 22 | 55 | 48 | 2.5-2.2x |
| Llama 3.1 70B FP8 batch 1 | OOM | ~70 | ~58 | n/a |
| Llama 3.1 70B FP8 concurrency 8 | OOM | ~340 aggr | ~280 aggr | n/a |
| 2x H100 NVLink Llama 70B FP8 | n/a | ~110 t/s decode | n/a | n/a |
| Mixtral 8x22B AWQ batch 1 | OOM | ~28 | ~24 | n/a |
| SDXL 1024×1024 | 3.4s | 2.0s | 2.3s | 1.7-1.5x |
| Llama 8B FP8 t/J (efficiency) | 3.4 | 5.0 | 5.5 | 1.5x H100 SXM |
The H100 SXM is roughly 1.7x the 4090 on small-model batch 1, 2x on aggregate batch, 2.5x on 70B AWQ INT4, and uniquely capable of running 70B FP8 (or larger with NVLink). The H100 PCIe is meaningfully slower than SXM – 14% on small batch, more on multi-card workloads where NVLink absence matters. For tokens-per-joule the H100 is about 50% more efficient than the 4090 because HBM3 is more energy-efficient per byte transferred than GDDR6X.
When H100 wins decisively
- Llama 70B FP8 native (no quantisation). 80GB HBM3 fits the full model with FP16 KV; 4090 cannot at all. Quality matters when evals are tight.
- Mixtral 8x22B and 100B+ models. 4090 OOMs entirely. H100 fits Mixtral 8x22B AWQ and runs at ~28 t/s; NVLink-bridged H100 pair fits FP8 versions.
- NVLink-bridged 70B FP8 production. 2x H100 NVLink runs Llama 70B FP8 at ~110 t/s decode with linear scaling on aggregate – dual 4090 caps at ~40 t/s with PCIe coordination tax.
- High-concurrency aggregate throughput >5,000 t/s. 3.35 TB/s bandwidth lets H100 scale far beyond the 4090’s ~1,800 t/s ceiling.
- FP8 training with bf16 master weights. 80GB headroom and Hopper TMA make training viable; 4090 lacks the VRAM and the TMA.
- MIG partitioning for tenant isolation. Need to slice into 7 isolated tenants on one card with hard memory and SM boundaries – only Hopper/Ampere offer this.
- Sub-100ms TTFT at very high concurrency. No other card matches HBM3 bandwidth for prefill; 4090 TTFT climbs past 200ms at concurrency >8.
- Burst workloads <6 hours/day. Pay $2.29/hour 6 times = $13.74; that is cheaper than dedicated. The break-even is around 250 hours/month.
When 4090 wins on TCO
- Llama 8B / 14B FP8 chat APIs always-on. Per-token cost is dramatically lower on dedicated 4090 – £0.039/M tokens vs £0.10-0.15/M tokens on RunPod H100.
- Llama 70B AWQ INT4. Fits on 24GB. H100 is 2.5x faster but ~3x the price – cost-per-token on dedicated 4090 still wins for steady traffic.
- SDXL and image generation. 4090 is plenty fast at 3.4s/image; H100 wasted on this workload.
- Always-on workloads >500 hours/month. The H100 hourly meter dominates; dedicated flat pricing wins.
- UK data residency requirements. Most cloud H100 capacity is US-region (RunPod, Lambda, Vast.ai). UK-dedicated 4090 satisfies GDPR data residency without the cross-border transfer questions.
- Predictable monthly billing. One invoice vs hourly meter that surprises Finance every month.
- Single-tenant workloads. No noisy-neighbour problems, no shared SM time, no rate limits.
- Cost-sensitive MVPs and side projects. £575/month is real money but it is committed and predictable.
Cost-per-token math across providers
Assume £575/month for a dedicated 4090 (~$730), and a cloud H100 PCIe at $2.49/hr always-on = $1,793/month. The H100 costs 2.5x more for ~2x small-model throughput – so cost-per-token is similar on Llama 8B FP8.
| Workload | 4090 dedicated £/M tokens | H100 PCIe £/M tokens | H100 SXM £/M tokens | Cheaper |
|---|---|---|---|---|
| Llama 8B FP8 24/7 | £0.039 | £0.052 | £0.045 | 4090 |
| Llama 70B AWQ INT4 24/7 | £0.34 | £0.32 | £0.28 | Tied / H100 |
| Llama 70B FP8 24/7 | n/a (OOM) | £0.32 | £0.25 | H100 only |
| Mixtral 8x22B AWQ | n/a (OOM) | £0.85 | £0.65 | H100 only |
| SDXL £/image | £0.0009 | £0.0028 | £0.0026 | 4090 |
| Llama 70B FP8 burst 4hr/day | n/a | £0.21 (only paying 4hr) | £0.18 | H100 burst |
For most always-on workloads the 4090 is the better economics. The H100 is the right answer when you need the model to fit (70B FP8, Mixtral 8x22B, 100B+), when you need MIG, when you need NVLink-bridged throughput, or when you can run burst traffic that fits in <500 hours/month of cloud meter.
Three concrete scenarios
Scenario A: chat SaaS with 5M tokens/day Llama 8B FP8
5M tokens/day = 150M tokens/month. Dedicated 4090 at £0.039/M = £5.85/month of effective compute on a £575/month card (utilisation 1%). RunPod H100 at £0.052/M = £7.80/month effective on a £1,650/month card (utilisation 0.5%). Dedicated 4090 wins by £1,075/month, and the 4090 has 99% headroom for traffic growth.
Scenario B: agent backend running Llama 70B FP8
The 4090 cannot run 70B FP8 at all – quality eval fails on AWQ INT4. Options are 2x 4090 TP=2 at £1,150/month for ~40 t/s decode (slow for an agent), 1x 6000 Pro at £2,200/month for ~28 t/s comfortable, or RunPod H100 PCIe at £1,420/month for ~58 t/s. RunPod H100 wins on £/throughput at modest scale, but if agent traffic exceeds ~2,000 t/s aggregate the dedicated 6000 Pro starts to win because it is always-on.
Scenario C: research lab with 4hr/day FP8 training bursts
Always-on dedicated 4090 sits idle 20hr/day. Burst 4hr/day on RunPod H100 SXM at $2.99/hr = $358/month, with 80GB HBM3 and FP8 TMA-aware training kernels that the 4090 cannot match. H100 burst wins on capability and cost, and the lab pays only for wall-clock training time. The research lab guide goes deeper on this pattern.
Production gotchas
- Cold-start latency on cloud. RunPod and Lambda spin up in 30s-3min for new pods. Dedicated 4090 is always warm. For latency-sensitive workloads the cold start eats your SLA.
- H100 PCIe vs SXM confusion. Cheaper providers list “H100” without specifying. PCIe is ~15% slower on single-card and lacks the SXM5 NVLink fabric. Always confirm.
- Storage and egress fees. Cloud H100 is the GPU rate; storage, network egress, and snapshots are billed separately and add 10-30% to the bill. Dedicated includes these.
- Spot/community instance preemption. RunPod Community can preempt with 10-second notice. Production workloads need Secure Cloud at +30% price.
- Data residency. Most cloud H100 is US/EU-Frankfurt. UK customers with strict residency need dedicated UK or AWS London H100 (rare and expensive).
- NVLink topology. Multi-card cloud H100 sometimes lacks NVLink within the rented slice (only within the SXM baseboard). Verify
nvidia-smi nvlink -safter provisioning. - FP8 kernel readiness. Older vLLM/TensorRT-LLM versions on the cloud image may not exploit Hopper TMA – rebuild on the latest CUDA 12.6+ container to get full H100 throughput.
Verdict and decision criteria
| Need | Best option |
|---|---|
| Always-on Llama 8B FP8 chat | Dedicated 4090 (cost-per-token wins) |
| Always-on Llama 70B AWQ INT4 | Dedicated 4090 (cost-per-token within 10% of H100) |
| Llama 70B FP8 production quality | RunPod H100 if <500hr/month, dedicated 6000 Pro otherwise |
| Mixtral 8x22B / 100B+ models | H100 (cloud or dedicated) – 4090 cannot |
| Research/training bursts <6hr/day | RunPod or Lambda H100 hourly |
| UK data residency required | Dedicated UK 4090, 5090 or 6000 Pro |
| Multi-tenant SaaS with isolation SLA | Cloud H100 with MIG |
| Sub-100ms TTFT at concurrency 32+ | H100 SXM (no alternative) |
| Image generation always-on | Dedicated 4090 (3x cheaper than H100) |
Verdict. The 4090 wins decisively on cost-per-token for any workload that fits in 24GB and runs always-on. The H100 wins decisively on workloads the 4090 cannot run, on burst patterns <500 hours/month, on multi-card NVLink topologies, and on training. AWS p5.48xlarge ($98/hour, $70k/month) is reserved for petabyte-scale training and serious enterprise inference – if you are reading this guide it is almost certainly not the right answer for you.
Production inference at fixed cost
One Ada AD102 in the UK, no hourly meter, no preemption. Dedicated GPU hosting.
Order the RTX 4090 24GBSee also: 4090 vs H100 spec deep-dive, vs RunPod pricing, vs Lambda Labs, vs Together AI pricing, 2x 4090 pairing, when to upgrade, vs A100 80GB, ROI analysis.