RTX 4090 24GB Dedicated vs Renting H100 in the Cloud GIGAGPU

Cloud H100 80GB instances are the gold standard for transformer inference. They have the bandwidth (3.35 TB/s HBM3), the VRAM (80GB), the NVLink fabric (900 GB/s peer), and native FP8 with TMA – and they rent for $2-4/hour from the major providers, $98/hour for an 8x SXM5 baseboard. A dedicated RTX 4090 24GB at flat UK pricing covers the same use cases at a fraction of the cost – if your model fits. This guide draws the line between the two: when the H100 actually wins, when the 4090 is enough, the per-token economics across providers, and the workload scenarios that justify a 5-10x premium. Both options sit in the broader UK GPU range.

Spec sheet

Spec	RTX 4090 24GB	H100 80GB SXM	H100 80GB PCIe
Architecture	Ada AD102	Hopper GH100	Hopper GH100
VRAM	24GB GDDR6X	80GB HBM3	80GB HBM3
Bandwidth	1,008 GB/s	3,350 GB/s	2,000 GB/s
FP8 TFLOPS (sparse)	~660	~3,958	~3,026
NVLink	No	Yes (900 GB/s peer)	Optional bridge (600 GB/s)
MIG	No	Yes (7 partitions)	Yes (7 partitions)
TDP	450W	700W	350W
Form factor	3-slot consumer	SXM5 module	2-slot PCIe
Native FP8	Yes (Ada 4th gen tensor cores)	Yes (Hopper TMA-aware)	Yes
TMA (Tensor Memory Accelerator)	No	Yes	Yes

The H100’s bandwidth advantage is the headline (3.3x the 4090) but the more important differences are NVLink fabric for multi-card scaling, MIG for tenant isolation, and TMA which accelerates async memory copies in Hopper-aware kernels. The full spec deep-dive is in the vs H100 post.

H100 cloud rates today

Provider	SKU	$/hour	$/month always-on	£/month equivalent
RunPod Community	H100 80GB	$2.29	$1,649	~£1,300
RunPod Secure Cloud	H100 80GB	$2.99	$2,153	~£1,700
Lambda Labs 1-Click	H100 PCIe	$2.49	$1,793	~£1,420
Vast.ai	H100 80GB	$1.50-3.00	$1,080-2,160	~£855-1,710
AWS p5.48xlarge	8x H100 SXM	$98.32	$70,791	~£56,100
Azure ND H100 v5	1x H100 SXM	$3.65	$2,628	~£2,080
GCP A3 High	1x H100 SXM	$3.40	$2,448	~£1,940
UK dedicated 4090	RTX 4090 24GB	n/a	~$700	£550-575
UK dedicated 5090	RTX 5090 32GB	n/a	~$1,150	£900
UK dedicated 6000 Pro	RTX 6000 Pro 96GB	n/a	~$2,800	£2,200

The headline number: RunPod Community H100 at $2.29/hr always-on is roughly $1,649/month – 2.4x the cost of a dedicated 4090. Lambda Labs H100 PCIe is 2.6x. AWS p5.48xlarge for 24/7 is $70k/month, which is only justified at petabyte-scale or sustained training. The RunPod pricing and Lambda Labs posts cover provider-specific gotchas.

Throughput comparison

Workload	4090 t/s	H100 SXM t/s	H100 PCIe t/s	H100 advantage
Llama 3.1 8B FP8 batch 1	198	330	285	1.67-1.44x
Llama 3.1 8B FP8 aggregate batch 32	1,100	2,200	1,900	2.0-1.7x
Llama 3.1 70B AWQ INT4 batch 1	22	55	48	2.5-2.2x
Llama 3.1 70B FP8 batch 1	OOM	~70	~58	n/a
Llama 3.1 70B FP8 concurrency 8	OOM	~340 aggr	~280 aggr	n/a
2x H100 NVLink Llama 70B FP8	n/a	~110 t/s decode	n/a	n/a
Mixtral 8x22B AWQ batch 1	OOM	~28	~24	n/a
SDXL 1024×1024	3.4s	2.0s	2.3s	1.7-1.5x
Llama 8B FP8 t/J (efficiency)	3.4	5.0	5.5	1.5x H100 SXM

The H100 SXM is roughly 1.7x the 4090 on small-model batch 1, 2x on aggregate batch, 2.5x on 70B AWQ INT4, and uniquely capable of running 70B FP8 (or larger with NVLink). The H100 PCIe is meaningfully slower than SXM – 14% on small batch, more on multi-card workloads where NVLink absence matters. For tokens-per-joule the H100 is about 50% more efficient than the 4090 because HBM3 is more energy-efficient per byte transferred than GDDR6X.

When H100 wins decisively

Llama 70B FP8 native (no quantisation). 80GB HBM3 fits the full model with FP16 KV; 4090 cannot at all. Quality matters when evals are tight.
Mixtral 8x22B and 100B+ models. 4090 OOMs entirely. H100 fits Mixtral 8x22B AWQ and runs at ~28 t/s; NVLink-bridged H100 pair fits FP8 versions.
NVLink-bridged 70B FP8 production. 2x H100 NVLink runs Llama 70B FP8 at ~110 t/s decode with linear scaling on aggregate – dual 4090 caps at ~40 t/s with PCIe coordination tax.
High-concurrency aggregate throughput >5,000 t/s. 3.35 TB/s bandwidth lets H100 scale far beyond the 4090’s ~1,800 t/s ceiling.
FP8 training with bf16 master weights. 80GB headroom and Hopper TMA make training viable; 4090 lacks the VRAM and the TMA.
MIG partitioning for tenant isolation. Need to slice into 7 isolated tenants on one card with hard memory and SM boundaries – only Hopper/Ampere offer this.
Sub-100ms TTFT at very high concurrency. No other card matches HBM3 bandwidth for prefill; 4090 TTFT climbs past 200ms at concurrency >8.
Burst workloads <6 hours/day. Pay $2.29/hour 6 times = $13.74; that is cheaper than dedicated. The break-even is around 250 hours/month.

When 4090 wins on TCO

Llama 8B / 14B FP8 chat APIs always-on. Per-token cost is dramatically lower on dedicated 4090 – £0.039/M tokens vs £0.10-0.15/M tokens on RunPod H100.
Llama 70B AWQ INT4. Fits on 24GB. H100 is 2.5x faster but ~3x the price – cost-per-token on dedicated 4090 still wins for steady traffic.
SDXL and image generation. 4090 is plenty fast at 3.4s/image; H100 wasted on this workload.
Always-on workloads >500 hours/month. The H100 hourly meter dominates; dedicated flat pricing wins.
UK data residency requirements. Most cloud H100 capacity is US-region (RunPod, Lambda, Vast.ai). UK-dedicated 4090 satisfies GDPR data residency without the cross-border transfer questions.
Predictable monthly billing. One invoice vs hourly meter that surprises Finance every month.
Single-tenant workloads. No noisy-neighbour problems, no shared SM time, no rate limits.
Cost-sensitive MVPs and side projects. £575/month is real money but it is committed and predictable.

Cost-per-token math across providers

Assume £575/month for a dedicated 4090 (~$730), and a cloud H100 PCIe at $2.49/hr always-on = $1,793/month. The H100 costs 2.5x more for ~2x small-model throughput – so cost-per-token is similar on Llama 8B FP8.

Workload	4090 dedicated £/M tokens	H100 PCIe £/M tokens	H100 SXM £/M tokens	Cheaper
Llama 8B FP8 24/7	£0.039	£0.052	£0.045	4090
Llama 70B AWQ INT4 24/7	£0.34	£0.32	£0.28	Tied / H100
Llama 70B FP8 24/7	n/a (OOM)	£0.32	£0.25	H100 only
Mixtral 8x22B AWQ	n/a (OOM)	£0.85	£0.65	H100 only
SDXL £/image	£0.0009	£0.0028	£0.0026	4090
Llama 70B FP8 burst 4hr/day	n/a	£0.21 (only paying 4hr)	£0.18	H100 burst

For most always-on workloads the 4090 is the better economics. The H100 is the right answer when you need the model to fit (70B FP8, Mixtral 8x22B, 100B+), when you need MIG, when you need NVLink-bridged throughput, or when you can run burst traffic that fits in <500 hours/month of cloud meter.

Three concrete scenarios

Scenario A: chat SaaS with 5M tokens/day Llama 8B FP8

5M tokens/day = 150M tokens/month. Dedicated 4090 at £0.039/M = £5.85/month of effective compute on a £575/month card (utilisation 1%). RunPod H100 at £0.052/M = £7.80/month effective on a £1,650/month card (utilisation 0.5%). Dedicated 4090 wins by £1,075/month, and the 4090 has 99% headroom for traffic growth.

Scenario B: agent backend running Llama 70B FP8

The 4090 cannot run 70B FP8 at all – quality eval fails on AWQ INT4. Options are 2x 4090 TP=2 at £1,150/month for ~40 t/s decode (slow for an agent), 1x 6000 Pro at £2,200/month for ~28 t/s comfortable, or RunPod H100 PCIe at £1,420/month for ~58 t/s. RunPod H100 wins on £/throughput at modest scale, but if agent traffic exceeds ~2,000 t/s aggregate the dedicated 6000 Pro starts to win because it is always-on.

Scenario C: research lab with 4hr/day FP8 training bursts

Always-on dedicated 4090 sits idle 20hr/day. Burst 4hr/day on RunPod H100 SXM at $2.99/hr = $358/month, with 80GB HBM3 and FP8 TMA-aware training kernels that the 4090 cannot match. H100 burst wins on capability and cost, and the lab pays only for wall-clock training time. The research lab guide goes deeper on this pattern.

Production gotchas

Cold-start latency on cloud. RunPod and Lambda spin up in 30s-3min for new pods. Dedicated 4090 is always warm. For latency-sensitive workloads the cold start eats your SLA.
H100 PCIe vs SXM confusion. Cheaper providers list “H100” without specifying. PCIe is ~15% slower on single-card and lacks the SXM5 NVLink fabric. Always confirm.
Storage and egress fees. Cloud H100 is the GPU rate; storage, network egress, and snapshots are billed separately and add 10-30% to the bill. Dedicated includes these.
Spot/community instance preemption. RunPod Community can preempt with 10-second notice. Production workloads need Secure Cloud at +30% price.
Data residency. Most cloud H100 is US/EU-Frankfurt. UK customers with strict residency need dedicated UK or AWS London H100 (rare and expensive).
NVLink topology. Multi-card cloud H100 sometimes lacks NVLink within the rented slice (only within the SXM baseboard). Verify nvidia-smi nvlink -s after provisioning.
FP8 kernel readiness. Older vLLM/TensorRT-LLM versions on the cloud image may not exploit Hopper TMA – rebuild on the latest CUDA 12.6+ container to get full H100 throughput.

Verdict and decision criteria

Need	Best option
Always-on Llama 8B FP8 chat	Dedicated 4090 (cost-per-token wins)
Always-on Llama 70B AWQ INT4	Dedicated 4090 (cost-per-token within 10% of H100)
Llama 70B FP8 production quality	RunPod H100 if <500hr/month, dedicated 6000 Pro otherwise
Mixtral 8x22B / 100B+ models	H100 (cloud or dedicated) – 4090 cannot
Research/training bursts <6hr/day	RunPod or Lambda H100 hourly
UK data residency required	Dedicated UK 4090, 5090 or 6000 Pro
Multi-tenant SaaS with isolation SLA	Cloud H100 with MIG
Sub-100ms TTFT at concurrency 32+	H100 SXM (no alternative)
Image generation always-on	Dedicated 4090 (3x cheaper than H100)

Verdict. The 4090 wins decisively on cost-per-token for any workload that fits in 24GB and runs always-on. The H100 wins decisively on workloads the 4090 cannot run, on burst patterns <500 hours/month, on multi-card NVLink topologies, and on training. AWS p5.48xlarge ($98/hour, $70k/month) is reserved for petabyte-scale training and serious enterprise inference – if you are reading this guide it is almost certainly not the right answer for you.

Production inference at fixed cost

One Ada AD102 in the UK, no hourly meter, no preemption. Dedicated GPU hosting.

Order the RTX 4090 24GB

RTX 4090 24GB Dedicated vs Renting H100 in the Cloud

Contents

Spec sheet

H100 cloud rates today

Throughput comparison

When H100 wins decisively

When 4090 wins on TCO

Cost-per-token math across providers

Three concrete scenarios

Scenario A: chat SaaS with 5M tokens/day Llama 8B FP8

Scenario B: agent backend running Llama 70B FP8

Scenario C: research lab with 4hr/day FP8 training bursts

Production gotchas

Verdict and decision criteria

Production inference at fixed cost

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB Dedicated vs Renting H100 in the Cloud

Contents

Spec sheet

H100 cloud rates today

Throughput comparison

When H100 wins decisively

When 4090 wins on TCO

Cost-per-token math across providers

Three concrete scenarios

Scenario A: chat SaaS with 5M tokens/day Llama 8B FP8

Scenario B: agent backend running Llama 70B FP8

Scenario C: research lab with 4hr/day FP8 training bursts

Production gotchas

Verdict and decision criteria

Production inference at fixed cost

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 4090 24GB or RTX 3090 24GB: Decision Guide

RTX 5060 Ti 16GB or RTX 5080 – Decision

Best Lambda Labs Alternatives for GPU Servers

Best Hugging Face Inference Endpoints Alternatives

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?