RTX 3050 - Order Now
Home / Blog / Alternatives / RTX 4090 24GB Dedicated vs Renting H100 in the Cloud
Alternatives

RTX 4090 24GB Dedicated vs Renting H100 in the Cloud

Hourly cloud H100 prices, the workloads that actually justify the 5-10x premium, the cost-per-token math, and when a flat-rate UK 4090 still wins.

Cloud H100 80GB instances are the gold standard for transformer inference. They have the bandwidth (3.35 TB/s HBM3), the VRAM (80GB), the NVLink fabric (900 GB/s peer), and native FP8 with TMA – and they rent for $2-4/hour from the major providers, $98/hour for an 8x SXM5 baseboard. A dedicated RTX 4090 24GB at flat UK pricing covers the same use cases at a fraction of the cost – if your model fits. This guide draws the line between the two: when the H100 actually wins, when the 4090 is enough, the per-token economics across providers, and the workload scenarios that justify a 5-10x premium. Both options sit in the broader UK GPU range.

Contents

Spec sheet

SpecRTX 4090 24GBH100 80GB SXMH100 80GB PCIe
ArchitectureAda AD102Hopper GH100Hopper GH100
VRAM24GB GDDR6X80GB HBM380GB HBM3
Bandwidth1,008 GB/s3,350 GB/s2,000 GB/s
FP8 TFLOPS (sparse)~660~3,958~3,026
NVLinkNoYes (900 GB/s peer)Optional bridge (600 GB/s)
MIGNoYes (7 partitions)Yes (7 partitions)
TDP450W700W350W
Form factor3-slot consumerSXM5 module2-slot PCIe
Native FP8Yes (Ada 4th gen tensor cores)Yes (Hopper TMA-aware)Yes
TMA (Tensor Memory Accelerator)NoYesYes

The H100’s bandwidth advantage is the headline (3.3x the 4090) but the more important differences are NVLink fabric for multi-card scaling, MIG for tenant isolation, and TMA which accelerates async memory copies in Hopper-aware kernels. The full spec deep-dive is in the vs H100 post.

H100 cloud rates today

ProviderSKU$/hour$/month always-on£/month equivalent
RunPod CommunityH100 80GB$2.29$1,649~£1,300
RunPod Secure CloudH100 80GB$2.99$2,153~£1,700
Lambda Labs 1-ClickH100 PCIe$2.49$1,793~£1,420
Vast.aiH100 80GB$1.50-3.00$1,080-2,160~£855-1,710
AWS p5.48xlarge8x H100 SXM$98.32$70,791~£56,100
Azure ND H100 v51x H100 SXM$3.65$2,628~£2,080
GCP A3 High1x H100 SXM$3.40$2,448~£1,940
UK dedicated 4090RTX 4090 24GBn/a~$700£550-575
UK dedicated 5090RTX 5090 32GBn/a~$1,150£900
UK dedicated 6000 ProRTX 6000 Pro 96GBn/a~$2,800£2,200

The headline number: RunPod Community H100 at $2.29/hr always-on is roughly $1,649/month – 2.4x the cost of a dedicated 4090. Lambda Labs H100 PCIe is 2.6x. AWS p5.48xlarge for 24/7 is $70k/month, which is only justified at petabyte-scale or sustained training. The RunPod pricing and Lambda Labs posts cover provider-specific gotchas.

Throughput comparison

Workload4090 t/sH100 SXM t/sH100 PCIe t/sH100 advantage
Llama 3.1 8B FP8 batch 11983302851.67-1.44x
Llama 3.1 8B FP8 aggregate batch 321,1002,2001,9002.0-1.7x
Llama 3.1 70B AWQ INT4 batch 12255482.5-2.2x
Llama 3.1 70B FP8 batch 1OOM~70~58n/a
Llama 3.1 70B FP8 concurrency 8OOM~340 aggr~280 aggrn/a
2x H100 NVLink Llama 70B FP8n/a~110 t/s decoden/an/a
Mixtral 8x22B AWQ batch 1OOM~28~24n/a
SDXL 1024×10243.4s2.0s2.3s1.7-1.5x
Llama 8B FP8 t/J (efficiency)3.45.05.51.5x H100 SXM

The H100 SXM is roughly 1.7x the 4090 on small-model batch 1, 2x on aggregate batch, 2.5x on 70B AWQ INT4, and uniquely capable of running 70B FP8 (or larger with NVLink). The H100 PCIe is meaningfully slower than SXM – 14% on small batch, more on multi-card workloads where NVLink absence matters. For tokens-per-joule the H100 is about 50% more efficient than the 4090 because HBM3 is more energy-efficient per byte transferred than GDDR6X.

When H100 wins decisively

  • Llama 70B FP8 native (no quantisation). 80GB HBM3 fits the full model with FP16 KV; 4090 cannot at all. Quality matters when evals are tight.
  • Mixtral 8x22B and 100B+ models. 4090 OOMs entirely. H100 fits Mixtral 8x22B AWQ and runs at ~28 t/s; NVLink-bridged H100 pair fits FP8 versions.
  • NVLink-bridged 70B FP8 production. 2x H100 NVLink runs Llama 70B FP8 at ~110 t/s decode with linear scaling on aggregate – dual 4090 caps at ~40 t/s with PCIe coordination tax.
  • High-concurrency aggregate throughput >5,000 t/s. 3.35 TB/s bandwidth lets H100 scale far beyond the 4090’s ~1,800 t/s ceiling.
  • FP8 training with bf16 master weights. 80GB headroom and Hopper TMA make training viable; 4090 lacks the VRAM and the TMA.
  • MIG partitioning for tenant isolation. Need to slice into 7 isolated tenants on one card with hard memory and SM boundaries – only Hopper/Ampere offer this.
  • Sub-100ms TTFT at very high concurrency. No other card matches HBM3 bandwidth for prefill; 4090 TTFT climbs past 200ms at concurrency >8.
  • Burst workloads <6 hours/day. Pay $2.29/hour 6 times = $13.74; that is cheaper than dedicated. The break-even is around 250 hours/month.

When 4090 wins on TCO

  • Llama 8B / 14B FP8 chat APIs always-on. Per-token cost is dramatically lower on dedicated 4090 – £0.039/M tokens vs £0.10-0.15/M tokens on RunPod H100.
  • Llama 70B AWQ INT4. Fits on 24GB. H100 is 2.5x faster but ~3x the price – cost-per-token on dedicated 4090 still wins for steady traffic.
  • SDXL and image generation. 4090 is plenty fast at 3.4s/image; H100 wasted on this workload.
  • Always-on workloads >500 hours/month. The H100 hourly meter dominates; dedicated flat pricing wins.
  • UK data residency requirements. Most cloud H100 capacity is US-region (RunPod, Lambda, Vast.ai). UK-dedicated 4090 satisfies GDPR data residency without the cross-border transfer questions.
  • Predictable monthly billing. One invoice vs hourly meter that surprises Finance every month.
  • Single-tenant workloads. No noisy-neighbour problems, no shared SM time, no rate limits.
  • Cost-sensitive MVPs and side projects. £575/month is real money but it is committed and predictable.

Cost-per-token math across providers

Assume £575/month for a dedicated 4090 (~$730), and a cloud H100 PCIe at $2.49/hr always-on = $1,793/month. The H100 costs 2.5x more for ~2x small-model throughput – so cost-per-token is similar on Llama 8B FP8.

Workload4090 dedicated £/M tokensH100 PCIe £/M tokensH100 SXM £/M tokensCheaper
Llama 8B FP8 24/7£0.039£0.052£0.0454090
Llama 70B AWQ INT4 24/7£0.34£0.32£0.28Tied / H100
Llama 70B FP8 24/7n/a (OOM)£0.32£0.25H100 only
Mixtral 8x22B AWQn/a (OOM)£0.85£0.65H100 only
SDXL £/image£0.0009£0.0028£0.00264090
Llama 70B FP8 burst 4hr/dayn/a£0.21 (only paying 4hr)£0.18H100 burst

For most always-on workloads the 4090 is the better economics. The H100 is the right answer when you need the model to fit (70B FP8, Mixtral 8x22B, 100B+), when you need MIG, when you need NVLink-bridged throughput, or when you can run burst traffic that fits in <500 hours/month of cloud meter.

Three concrete scenarios

Scenario A: chat SaaS with 5M tokens/day Llama 8B FP8

5M tokens/day = 150M tokens/month. Dedicated 4090 at £0.039/M = £5.85/month of effective compute on a £575/month card (utilisation 1%). RunPod H100 at £0.052/M = £7.80/month effective on a £1,650/month card (utilisation 0.5%). Dedicated 4090 wins by £1,075/month, and the 4090 has 99% headroom for traffic growth.

Scenario B: agent backend running Llama 70B FP8

The 4090 cannot run 70B FP8 at all – quality eval fails on AWQ INT4. Options are 2x 4090 TP=2 at £1,150/month for ~40 t/s decode (slow for an agent), 1x 6000 Pro at £2,200/month for ~28 t/s comfortable, or RunPod H100 PCIe at £1,420/month for ~58 t/s. RunPod H100 wins on £/throughput at modest scale, but if agent traffic exceeds ~2,000 t/s aggregate the dedicated 6000 Pro starts to win because it is always-on.

Scenario C: research lab with 4hr/day FP8 training bursts

Always-on dedicated 4090 sits idle 20hr/day. Burst 4hr/day on RunPod H100 SXM at $2.99/hr = $358/month, with 80GB HBM3 and FP8 TMA-aware training kernels that the 4090 cannot match. H100 burst wins on capability and cost, and the lab pays only for wall-clock training time. The research lab guide goes deeper on this pattern.

Production gotchas

  1. Cold-start latency on cloud. RunPod and Lambda spin up in 30s-3min for new pods. Dedicated 4090 is always warm. For latency-sensitive workloads the cold start eats your SLA.
  2. H100 PCIe vs SXM confusion. Cheaper providers list “H100” without specifying. PCIe is ~15% slower on single-card and lacks the SXM5 NVLink fabric. Always confirm.
  3. Storage and egress fees. Cloud H100 is the GPU rate; storage, network egress, and snapshots are billed separately and add 10-30% to the bill. Dedicated includes these.
  4. Spot/community instance preemption. RunPod Community can preempt with 10-second notice. Production workloads need Secure Cloud at +30% price.
  5. Data residency. Most cloud H100 is US/EU-Frankfurt. UK customers with strict residency need dedicated UK or AWS London H100 (rare and expensive).
  6. NVLink topology. Multi-card cloud H100 sometimes lacks NVLink within the rented slice (only within the SXM baseboard). Verify nvidia-smi nvlink -s after provisioning.
  7. FP8 kernel readiness. Older vLLM/TensorRT-LLM versions on the cloud image may not exploit Hopper TMA – rebuild on the latest CUDA 12.6+ container to get full H100 throughput.

Verdict and decision criteria

NeedBest option
Always-on Llama 8B FP8 chatDedicated 4090 (cost-per-token wins)
Always-on Llama 70B AWQ INT4Dedicated 4090 (cost-per-token within 10% of H100)
Llama 70B FP8 production qualityRunPod H100 if <500hr/month, dedicated 6000 Pro otherwise
Mixtral 8x22B / 100B+ modelsH100 (cloud or dedicated) – 4090 cannot
Research/training bursts <6hr/dayRunPod or Lambda H100 hourly
UK data residency requiredDedicated UK 4090, 5090 or 6000 Pro
Multi-tenant SaaS with isolation SLACloud H100 with MIG
Sub-100ms TTFT at concurrency 32+H100 SXM (no alternative)
Image generation always-onDedicated 4090 (3x cheaper than H100)

Verdict. The 4090 wins decisively on cost-per-token for any workload that fits in 24GB and runs always-on. The H100 wins decisively on workloads the 4090 cannot run, on burst patterns <500 hours/month, on multi-card NVLink topologies, and on training. AWS p5.48xlarge ($98/hour, $70k/month) is reserved for petabyte-scale training and serious enterprise inference – if you are reading this guide it is almost certainly not the right answer for you.

Production inference at fixed cost

One Ada AD102 in the UK, no hourly meter, no preemption. Dedicated GPU hosting.

Order the RTX 4090 24GB

See also: 4090 vs H100 spec deep-dive, vs RunPod pricing, vs Lambda Labs, vs Together AI pricing, 2x 4090 pairing, when to upgrade, vs A100 80GB, ROI analysis.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?