Home / Blog / GPU Comparisons / RTX 4090 24GB vs H100 80GB SXM: Consumer FP8 vs Datacentre FP8

GPU Comparisons

RTX 4090 24GB vs H100 80GB SXM: Consumer FP8 vs Datacentre FP8

How the consumer RTX 4090 stacks up against NVIDIA's datacentre H100 — Hopper FP8 throughput, HBM3 bandwidth, NVLink, MIG support, and per-workload economics.

GPU Comparisons May 4, 2026 5 min read gigagpu

NVIDIA’s H100 80GB SXM is the datacentre Hopper part: 14,592 CUDA cores, 80 GB of HBM3 at 3.35 TB/s, NVLink at 900 GB/s, MIG 7-way partitioning, and the first-generation Transformer Engine that pioneered FP8 inference. The RTX 4090 24GB is the consumer Ada part with the same FP8 capability but a third of the bandwidth and a fraction of the VRAM. For UK GPU hosting the comparison reveals where datacentre features earn their premium and where the 4090 quietly competes — including the surprising case where the 4090 wins on £/token despite being a much smaller card.

Spec sheet side by side

Spec	RTX 4090 (Ada AD102)	H100 80GB SXM (Hopper GH100)	Delta
Process	TSMC 4N	TSMC 4N	Same
Transistors	76.3 billion	80 billion	+5%
SM count	128	132	+3%
CUDA cores	16,384	14,592 (FP32-dedicated)	-11%
Tensor cores	512 (4th gen)	528 (4th gen Hopper)	Equivalent
FP16 dense TFLOPS	165	~989	6x
FP8 dense TFLOPS	660 (sparse)	~1979	3x
VRAM	24 GB GDDR6X	80 GB HBM3	3.3x
Memory bandwidth	1008 GB/s	3.35 TB/s	3.32x
L2 cache	72 MB	50 MB	NVIDIA Ada wins
NVLink	None	NVLink 900 GB/s	Datacentre
MIG	None	7-way partitioning	Multi-tenant
TDP	450W	700W	+56%
Form factor	3.5-slot consumer	SXM5 module	HGX-only
Approx UK price (2026)	£1,300	£25,000+	19x

The H100 is a different category of accelerator: 3.3x the bandwidth, 3x the FP8 throughput, NVLink for multi-card scaling, MIG for multi-tenancy. It also costs ~19x more and requires an HGX chassis. The H100 80GB PCIe variant exists at lower bandwidth (2 TB/s) and slightly lower price, but the comparison points are similar.

HBM3 vs GDDR6X — bandwidth physics

Decode-bound LLM inference scales almost linearly with bandwidth at batch 1. The H100’s 3.35 TB/s gives a theoretical FP8 8B decode ceiling around ~480 t/s; in practice it sustains ~330 t/s. The 4090 at 1008 GB/s sustains ~198 t/s. The ratio (1.67x) is well below the bandwidth ratio (3.3x), partly because the 4090’s larger 72 MB L2 cache absorbs more weight reads than the H100’s 50 MB L2, and partly because NVIDIA tunes Hopper for batched inference where compute matters more.

At batch 32, the H100 pulls away decisively: ~3300 aggregate t/s vs the 4090’s 1100 — a 3x ratio that tracks bandwidth more closely once L2 hit rates equalise. The H100 is the better choice as concurrency grows; the 4090 is the better choice when you serve a single user at very high speed.

MIG, NVLink and datacentre features

MIG (Multi-Instance GPU) lets a single H100 present as 7 isolated GPU partitions, each with its own VRAM slice, compute, and address space. For multi-tenant inference (separate customers on a shared card), this is genuinely useful — partition isolation prevents one tenant’s memory leak from killing another’s. The 4090 has no MIG; you isolate via container memory limits and accept the lack of hardware enforcement.

NVLink at 900 GB/s lets two H100s present as effectively one 160 GB unified accelerator with bandwidth in the multi-TB/s range. For 175B+ models, this is the only sane way to serve at low latency. The 4090 has no NVLink; multi-card goes over PCIe Gen 4 x16 at 28 GB/s, which is fine for tensor-parallel inference of 70B models but a poor fit for training. See multi-card pairing.

Throughput across nine workloads

Workload	RTX 4090	H100 80GB	H100 / 4090
Llama 3.1 8B FP8 decode b1	198 t/s	330 t/s	1.67x
Llama 3.1 8B FP8 batch 32 agg	1100 t/s	3300 t/s	3.00x
Llama 3.1 70B AWQ b1	22-24 t/s	~80 t/s	3.40x
Llama 3.1 70B FP8 b1	OOM	~110 t/s	H100 only
Llama 3.1 70B FP8 NVLink pair (2x)	n/a	~150 t/s	H100 only
Qwen 2.5 72B FP8	OOM	~62 t/s	H100 only
SDXL 1024×1024	2.0s	~1.2s	1.67x
FLUX.1-dev FP8	4.1s	~2.2s	1.86x
QLoRA Llama 8B (steps/s)	2.6	~7.0	2.69x

The H100 is 1.7-3.4x faster on workloads both can run. For workloads only the H100 can run (70B FP8, 72B FP8, big-context multi-tenant), the 4090 is not in the conversation.

Per-token economics

Metric	RTX 4090	H100 80GB SXM
TDP	450W	700W
Sustained LLM b32	360W	580W
UK list price (2026)	£1,300	£25,000+
HGX chassis required	No (4U)	Yes (£40k+ for 8x)
Effective £/card landed (8x HGX)	£1,300	~£30,000
£/aggregate t/s b32 (Llama 8B)	£1.18	£9.09
£/decode t/s b1 (Llama 8B)	£6.57	£90.91
UK cloud rental (typical)	£0.70-1.20/hr	£3.50-5.00/hr
Annual electricity @ 24/7 £0.18/kWh	£568	£915

For Llama 8B, the 4090 wins on £/token by a factor of 8x. For Llama 70B FP8, the comparison is moot — the 4090 cannot run it. See the vs cloud H100 rental analysis.

Per-workload winner table

Workload	Winner	Why
200-MAU SaaS RAG on Llama 8B	4090	8x cheaper, throughput sufficient
12-engineer Qwen 32B AWQ	4090	Fits, H100 overkill
Llama 70B FP8 production endpoint	H100	4090 cannot fit FP8
Multi-tenant 70B endpoint, isolated	H100	MIG required
500+ concurrent 8B sessions	H100	Bandwidth and VRAM
Production fine-tuning at scale	H100	NVLink for multi-card
FLUX.1-dev hobby studio	4090	H100 overkill
Llama 405B inference	H100 (cluster)	Single 4090 can’t fit
Capex-bounded SaaS under £20k/yr	4090	Only option
Regulated workloads (audit, ECC)	H100	HBM3 has ECC, MIG isolation

vLLM serving examples

# RTX 4090 — Llama 3 8B FP8, 32-way batch, 16k context
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 16384 --max-num-seqs 32 \
  --gpu-memory-utilization 0.92

# H100 — Llama 70B FP8 single-card, 64-way batching
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 32768 --max-num-seqs 64 \
  --gpu-memory-utilization 0.92

# H100 NVLink pair — Llama 405B FP8 across 2 cards
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8 \
  --tensor-parallel-size 2 \
  --quantization fp8 \
  --max-model-len 8192 --max-num-seqs 8

Production gotchas with H100

SXM-only is HGX-only. You cannot drop a H100 SXM into a PCIe slot. The PCIe variant exists but at 2 TB/s instead of 3.35.
UK SXM capacity is rationed. Most UK H100 SXM lives in hyperscale clouds; on-prem deployment requires £40k+ chassis and waiting lists.
NCCL tuning matters. Default NCCL settings work but for top performance you need topology-aware all-reduce configuration. Budget engineering time.
Cooling: liquid often required. 700W per card x 8 in an HGX is 5.6 kW. Air-cooled chassis exist but expect throttling under sustained load.
MIG partitioning is irreversible per-boot. Repartitioning requires reset. Plan tenant slicing in advance.
Driver pinning is non-optional. H100 production stacks pin specific CUDA + driver + NCCL combinations. Updates need full validation.
Reservation pricing dominates. Spot/on-demand H100 is expensive; 1-3 year reserved pricing brings hourly cost down 50-70%.

Verdict

Pick the RTX 4090 24GB if your model fits in 24GB; you serve fewer than ~100 concurrent users; you are price-sensitive; or you need UK-located on-prem hosting.
Pick the H100 80GB if you need 70B+ at FP8, MIG isolation for multi-tenant SaaS, NVLink for multi-card model parallelism, or thousands of concurrent sessions on smaller models.
Pick neither if you need 192GB on a single card — go to MI300X 192GB.

For a 200-MAU SaaS RAG, the 4090 is the right answer. For a regional bank running a Llama 70B FP8 audit-grade endpoint with 200+ concurrent sessions and ECC requirements, the H100 is the only credible choice.

Don’t pay datacentre prices for consumer-tier workloads

GigaGPU’s UK dedicated hosting offers the RTX 4090 24GB at a fraction of H100 cost — perfectly sized for the workloads that don’t actually need HBM3.

Order the RTX 4090 24GB

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

GPU Comparisons

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RTX 4090 24GB vs H100 80GB SXM: Consumer FP8 vs Datacentre FP8

Contents

Spec sheet side by side

HBM3 vs GDDR6X — bandwidth physics

MIG, NVLink and datacentre features

Throughput across nine workloads

Per-token economics

Per-workload winner table

vLLM serving examples

Production gotchas with H100

Verdict

Don’t pay datacentre prices for consumer-tier workloads

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB vs H100 80GB SXM: Consumer FP8 vs Datacentre FP8

Contents

Spec sheet side by side

HBM3 vs GDDR6X — bandwidth physics

MIG, NVLink and datacentre features

Throughput across nine workloads

Per-token economics

Per-workload winner table

vLLM serving examples

Production gotchas with H100

Verdict

Don’t pay datacentre prices for consumer-tier workloads

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 5060 Ti 16GB to RTX 6000 Pro Upgrade

LLaMA 3 8B vs DeepSeek 7B for API Serving (Throughput): GPU Benchmark

RTX 4060 vs RTX 5060 – Same 8GB, Different Silicon

Best GPU for Embedding Workloads in 2026

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?