NVIDIA’s H100 80GB SXM is the datacentre Hopper part: 14,592 CUDA cores, 80 GB of HBM3 at 3.35 TB/s, NVLink at 900 GB/s, MIG 7-way partitioning, and the first-generation Transformer Engine that pioneered FP8 inference. The RTX 4090 24GB is the consumer Ada part with the same FP8 capability but a third of the bandwidth and a fraction of the VRAM. For UK GPU hosting the comparison reveals where datacentre features earn their premium and where the 4090 quietly competes — including the surprising case where the 4090 wins on £/token despite being a much smaller card.
Contents
- Spec sheet side by side
- HBM3 vs GDDR6X — bandwidth physics
- MIG, NVLink and datacentre features
- Throughput across nine workloads
- Per-token economics
- Per-workload winner table
- vLLM serving examples
- Production gotchas with H100
- Verdict
Spec sheet side by side
| Spec | RTX 4090 (Ada AD102) | H100 80GB SXM (Hopper GH100) | Delta |
|---|---|---|---|
| Process | TSMC 4N | TSMC 4N | Same |
| Transistors | 76.3 billion | 80 billion | +5% |
| SM count | 128 | 132 | +3% |
| CUDA cores | 16,384 | 14,592 (FP32-dedicated) | -11% |
| Tensor cores | 512 (4th gen) | 528 (4th gen Hopper) | Equivalent |
| FP16 dense TFLOPS | 165 | ~989 | 6x |
| FP8 dense TFLOPS | 660 (sparse) | ~1979 | 3x |
| VRAM | 24 GB GDDR6X | 80 GB HBM3 | 3.3x |
| Memory bandwidth | 1008 GB/s | 3.35 TB/s | 3.32x |
| L2 cache | 72 MB | 50 MB | NVIDIA Ada wins |
| NVLink | None | NVLink 900 GB/s | Datacentre |
| MIG | None | 7-way partitioning | Multi-tenant |
| TDP | 450W | 700W | +56% |
| Form factor | 3.5-slot consumer | SXM5 module | HGX-only |
| Approx UK price (2026) | £1,300 | £25,000+ | 19x |
The H100 is a different category of accelerator: 3.3x the bandwidth, 3x the FP8 throughput, NVLink for multi-card scaling, MIG for multi-tenancy. It also costs ~19x more and requires an HGX chassis. The H100 80GB PCIe variant exists at lower bandwidth (2 TB/s) and slightly lower price, but the comparison points are similar.
HBM3 vs GDDR6X — bandwidth physics
Decode-bound LLM inference scales almost linearly with bandwidth at batch 1. The H100’s 3.35 TB/s gives a theoretical FP8 8B decode ceiling around ~480 t/s; in practice it sustains ~330 t/s. The 4090 at 1008 GB/s sustains ~198 t/s. The ratio (1.67x) is well below the bandwidth ratio (3.3x), partly because the 4090’s larger 72 MB L2 cache absorbs more weight reads than the H100’s 50 MB L2, and partly because NVIDIA tunes Hopper for batched inference where compute matters more.
At batch 32, the H100 pulls away decisively: ~3300 aggregate t/s vs the 4090’s 1100 — a 3x ratio that tracks bandwidth more closely once L2 hit rates equalise. The H100 is the better choice as concurrency grows; the 4090 is the better choice when you serve a single user at very high speed.
MIG, NVLink and datacentre features
MIG (Multi-Instance GPU) lets a single H100 present as 7 isolated GPU partitions, each with its own VRAM slice, compute, and address space. For multi-tenant inference (separate customers on a shared card), this is genuinely useful — partition isolation prevents one tenant’s memory leak from killing another’s. The 4090 has no MIG; you isolate via container memory limits and accept the lack of hardware enforcement.
NVLink at 900 GB/s lets two H100s present as effectively one 160 GB unified accelerator with bandwidth in the multi-TB/s range. For 175B+ models, this is the only sane way to serve at low latency. The 4090 has no NVLink; multi-card goes over PCIe Gen 4 x16 at 28 GB/s, which is fine for tensor-parallel inference of 70B models but a poor fit for training. See multi-card pairing.
Throughput across nine workloads
| Workload | RTX 4090 | H100 80GB | H100 / 4090 |
|---|---|---|---|
| Llama 3.1 8B FP8 decode b1 | 198 t/s | 330 t/s | 1.67x |
| Llama 3.1 8B FP8 batch 32 agg | 1100 t/s | 3300 t/s | 3.00x |
| Llama 3.1 70B AWQ b1 | 22-24 t/s | ~80 t/s | 3.40x |
| Llama 3.1 70B FP8 b1 | OOM | ~110 t/s | H100 only |
| Llama 3.1 70B FP8 NVLink pair (2x) | n/a | ~150 t/s | H100 only |
| Qwen 2.5 72B FP8 | OOM | ~62 t/s | H100 only |
| SDXL 1024×1024 | 2.0s | ~1.2s | 1.67x |
| FLUX.1-dev FP8 | 4.1s | ~2.2s | 1.86x |
| QLoRA Llama 8B (steps/s) | 2.6 | ~7.0 | 2.69x |
The H100 is 1.7-3.4x faster on workloads both can run. For workloads only the H100 can run (70B FP8, 72B FP8, big-context multi-tenant), the 4090 is not in the conversation.
Per-token economics
| Metric | RTX 4090 | H100 80GB SXM |
|---|---|---|
| TDP | 450W | 700W |
| Sustained LLM b32 | 360W | 580W |
| UK list price (2026) | £1,300 | £25,000+ |
| HGX chassis required | No (4U) | Yes (£40k+ for 8x) |
| Effective £/card landed (8x HGX) | £1,300 | ~£30,000 |
| £/aggregate t/s b32 (Llama 8B) | £1.18 | £9.09 |
| £/decode t/s b1 (Llama 8B) | £6.57 | £90.91 |
| UK cloud rental (typical) | £0.70-1.20/hr | £3.50-5.00/hr |
| Annual electricity @ 24/7 £0.18/kWh | £568 | £915 |
For Llama 8B, the 4090 wins on £/token by a factor of 8x. For Llama 70B FP8, the comparison is moot — the 4090 cannot run it. See the vs cloud H100 rental analysis.
Per-workload winner table
| Workload | Winner | Why |
|---|---|---|
| 200-MAU SaaS RAG on Llama 8B | 4090 | 8x cheaper, throughput sufficient |
| 12-engineer Qwen 32B AWQ | 4090 | Fits, H100 overkill |
| Llama 70B FP8 production endpoint | H100 | 4090 cannot fit FP8 |
| Multi-tenant 70B endpoint, isolated | H100 | MIG required |
| 500+ concurrent 8B sessions | H100 | Bandwidth and VRAM |
| Production fine-tuning at scale | H100 | NVLink for multi-card |
| FLUX.1-dev hobby studio | 4090 | H100 overkill |
| Llama 405B inference | H100 (cluster) | Single 4090 can’t fit |
| Capex-bounded SaaS under £20k/yr | 4090 | Only option |
| Regulated workloads (audit, ECC) | H100 | HBM3 has ECC, MIG isolation |
vLLM serving examples
# RTX 4090 — Llama 3 8B FP8, 32-way batch, 16k context
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--quantization fp8 --kv-cache-dtype fp8_e4m3 \
--max-model-len 16384 --max-num-seqs 32 \
--gpu-memory-utilization 0.92
# H100 — Llama 70B FP8 single-card, 64-way batching
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
--quantization fp8 --kv-cache-dtype fp8_e4m3 \
--max-model-len 32768 --max-num-seqs 64 \
--gpu-memory-utilization 0.92
# H100 NVLink pair — Llama 405B FP8 across 2 cards
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8 \
--tensor-parallel-size 2 \
--quantization fp8 \
--max-model-len 8192 --max-num-seqs 8
Production gotchas with H100
- SXM-only is HGX-only. You cannot drop a H100 SXM into a PCIe slot. The PCIe variant exists but at 2 TB/s instead of 3.35.
- UK SXM capacity is rationed. Most UK H100 SXM lives in hyperscale clouds; on-prem deployment requires £40k+ chassis and waiting lists.
- NCCL tuning matters. Default NCCL settings work but for top performance you need topology-aware all-reduce configuration. Budget engineering time.
- Cooling: liquid often required. 700W per card x 8 in an HGX is 5.6 kW. Air-cooled chassis exist but expect throttling under sustained load.
- MIG partitioning is irreversible per-boot. Repartitioning requires reset. Plan tenant slicing in advance.
- Driver pinning is non-optional. H100 production stacks pin specific CUDA + driver + NCCL combinations. Updates need full validation.
- Reservation pricing dominates. Spot/on-demand H100 is expensive; 1-3 year reserved pricing brings hourly cost down 50-70%.
Verdict
- Pick the RTX 4090 24GB if your model fits in 24GB; you serve fewer than ~100 concurrent users; you are price-sensitive; or you need UK-located on-prem hosting.
- Pick the H100 80GB if you need 70B+ at FP8, MIG isolation for multi-tenant SaaS, NVLink for multi-card model parallelism, or thousands of concurrent sessions on smaller models.
- Pick neither if you need 192GB on a single card — go to MI300X 192GB.
For a 200-MAU SaaS RAG, the 4090 is the right answer. For a regional bank running a Llama 70B FP8 audit-grade endpoint with 200+ concurrent sessions and ECC requirements, the H100 is the only credible choice.
Don’t pay datacentre prices for consumer-tier workloads
GigaGPU’s UK dedicated hosting offers the RTX 4090 24GB at a fraction of H100 cost — perfectly sized for the workloads that don’t actually need HBM3.
Order the RTX 4090 24GBSee also: vs A100 80GB, vs MI300X 192GB, vs RTX 6000 Pro 96GB, vs cloud H100, RTX 4090 spec breakdown, FP8 tensor cores on Ada, 2026 tier positioning.