RTX 3050 - Order Now
Home / Blog / GPU Comparisons / RTX 4090 24GB vs H100 80GB SXM: Consumer FP8 vs Datacentre FP8
GPU Comparisons

RTX 4090 24GB vs H100 80GB SXM: Consumer FP8 vs Datacentre FP8

How the consumer RTX 4090 stacks up against NVIDIA's datacentre H100 — Hopper FP8 throughput, HBM3 bandwidth, NVLink, MIG support, and per-workload economics.

NVIDIA’s H100 80GB SXM is the datacentre Hopper part: 14,592 CUDA cores, 80 GB of HBM3 at 3.35 TB/s, NVLink at 900 GB/s, MIG 7-way partitioning, and the first-generation Transformer Engine that pioneered FP8 inference. The RTX 4090 24GB is the consumer Ada part with the same FP8 capability but a third of the bandwidth and a fraction of the VRAM. For UK GPU hosting the comparison reveals where datacentre features earn their premium and where the 4090 quietly competes — including the surprising case where the 4090 wins on £/token despite being a much smaller card.

Contents

Spec sheet side by side

SpecRTX 4090 (Ada AD102)H100 80GB SXM (Hopper GH100)Delta
ProcessTSMC 4NTSMC 4NSame
Transistors76.3 billion80 billion+5%
SM count128132+3%
CUDA cores16,38414,592 (FP32-dedicated)-11%
Tensor cores512 (4th gen)528 (4th gen Hopper)Equivalent
FP16 dense TFLOPS165~9896x
FP8 dense TFLOPS660 (sparse)~19793x
VRAM24 GB GDDR6X80 GB HBM33.3x
Memory bandwidth1008 GB/s3.35 TB/s3.32x
L2 cache72 MB50 MBNVIDIA Ada wins
NVLinkNoneNVLink 900 GB/sDatacentre
MIGNone7-way partitioningMulti-tenant
TDP450W700W+56%
Form factor3.5-slot consumerSXM5 moduleHGX-only
Approx UK price (2026)£1,300£25,000+19x

The H100 is a different category of accelerator: 3.3x the bandwidth, 3x the FP8 throughput, NVLink for multi-card scaling, MIG for multi-tenancy. It also costs ~19x more and requires an HGX chassis. The H100 80GB PCIe variant exists at lower bandwidth (2 TB/s) and slightly lower price, but the comparison points are similar.

HBM3 vs GDDR6X — bandwidth physics

Decode-bound LLM inference scales almost linearly with bandwidth at batch 1. The H100’s 3.35 TB/s gives a theoretical FP8 8B decode ceiling around ~480 t/s; in practice it sustains ~330 t/s. The 4090 at 1008 GB/s sustains ~198 t/s. The ratio (1.67x) is well below the bandwidth ratio (3.3x), partly because the 4090’s larger 72 MB L2 cache absorbs more weight reads than the H100’s 50 MB L2, and partly because NVIDIA tunes Hopper for batched inference where compute matters more.

At batch 32, the H100 pulls away decisively: ~3300 aggregate t/s vs the 4090’s 1100 — a 3x ratio that tracks bandwidth more closely once L2 hit rates equalise. The H100 is the better choice as concurrency grows; the 4090 is the better choice when you serve a single user at very high speed.

MIG, NVLink and datacentre features

MIG (Multi-Instance GPU) lets a single H100 present as 7 isolated GPU partitions, each with its own VRAM slice, compute, and address space. For multi-tenant inference (separate customers on a shared card), this is genuinely useful — partition isolation prevents one tenant’s memory leak from killing another’s. The 4090 has no MIG; you isolate via container memory limits and accept the lack of hardware enforcement.

NVLink at 900 GB/s lets two H100s present as effectively one 160 GB unified accelerator with bandwidth in the multi-TB/s range. For 175B+ models, this is the only sane way to serve at low latency. The 4090 has no NVLink; multi-card goes over PCIe Gen 4 x16 at 28 GB/s, which is fine for tensor-parallel inference of 70B models but a poor fit for training. See multi-card pairing.

Throughput across nine workloads

WorkloadRTX 4090H100 80GBH100 / 4090
Llama 3.1 8B FP8 decode b1198 t/s330 t/s1.67x
Llama 3.1 8B FP8 batch 32 agg1100 t/s3300 t/s3.00x
Llama 3.1 70B AWQ b122-24 t/s~80 t/s3.40x
Llama 3.1 70B FP8 b1OOM~110 t/sH100 only
Llama 3.1 70B FP8 NVLink pair (2x)n/a~150 t/sH100 only
Qwen 2.5 72B FP8OOM~62 t/sH100 only
SDXL 1024×10242.0s~1.2s1.67x
FLUX.1-dev FP84.1s~2.2s1.86x
QLoRA Llama 8B (steps/s)2.6~7.02.69x

The H100 is 1.7-3.4x faster on workloads both can run. For workloads only the H100 can run (70B FP8, 72B FP8, big-context multi-tenant), the 4090 is not in the conversation.

Per-token economics

MetricRTX 4090H100 80GB SXM
TDP450W700W
Sustained LLM b32360W580W
UK list price (2026)£1,300£25,000+
HGX chassis requiredNo (4U)Yes (£40k+ for 8x)
Effective £/card landed (8x HGX)£1,300~£30,000
£/aggregate t/s b32 (Llama 8B)£1.18£9.09
£/decode t/s b1 (Llama 8B)£6.57£90.91
UK cloud rental (typical)£0.70-1.20/hr£3.50-5.00/hr
Annual electricity @ 24/7 £0.18/kWh£568£915

For Llama 8B, the 4090 wins on £/token by a factor of 8x. For Llama 70B FP8, the comparison is moot — the 4090 cannot run it. See the vs cloud H100 rental analysis.

Per-workload winner table

WorkloadWinnerWhy
200-MAU SaaS RAG on Llama 8B40908x cheaper, throughput sufficient
12-engineer Qwen 32B AWQ4090Fits, H100 overkill
Llama 70B FP8 production endpointH1004090 cannot fit FP8
Multi-tenant 70B endpoint, isolatedH100MIG required
500+ concurrent 8B sessionsH100Bandwidth and VRAM
Production fine-tuning at scaleH100NVLink for multi-card
FLUX.1-dev hobby studio4090H100 overkill
Llama 405B inferenceH100 (cluster)Single 4090 can’t fit
Capex-bounded SaaS under £20k/yr4090Only option
Regulated workloads (audit, ECC)H100HBM3 has ECC, MIG isolation

vLLM serving examples

# RTX 4090 — Llama 3 8B FP8, 32-way batch, 16k context
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 16384 --max-num-seqs 32 \
  --gpu-memory-utilization 0.92
# H100 — Llama 70B FP8 single-card, 64-way batching
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 32768 --max-num-seqs 64 \
  --gpu-memory-utilization 0.92
# H100 NVLink pair — Llama 405B FP8 across 2 cards
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8 \
  --tensor-parallel-size 2 \
  --quantization fp8 \
  --max-model-len 8192 --max-num-seqs 8

Production gotchas with H100

  • SXM-only is HGX-only. You cannot drop a H100 SXM into a PCIe slot. The PCIe variant exists but at 2 TB/s instead of 3.35.
  • UK SXM capacity is rationed. Most UK H100 SXM lives in hyperscale clouds; on-prem deployment requires £40k+ chassis and waiting lists.
  • NCCL tuning matters. Default NCCL settings work but for top performance you need topology-aware all-reduce configuration. Budget engineering time.
  • Cooling: liquid often required. 700W per card x 8 in an HGX is 5.6 kW. Air-cooled chassis exist but expect throttling under sustained load.
  • MIG partitioning is irreversible per-boot. Repartitioning requires reset. Plan tenant slicing in advance.
  • Driver pinning is non-optional. H100 production stacks pin specific CUDA + driver + NCCL combinations. Updates need full validation.
  • Reservation pricing dominates. Spot/on-demand H100 is expensive; 1-3 year reserved pricing brings hourly cost down 50-70%.

Verdict

  • Pick the RTX 4090 24GB if your model fits in 24GB; you serve fewer than ~100 concurrent users; you are price-sensitive; or you need UK-located on-prem hosting.
  • Pick the H100 80GB if you need 70B+ at FP8, MIG isolation for multi-tenant SaaS, NVLink for multi-card model parallelism, or thousands of concurrent sessions on smaller models.
  • Pick neither if you need 192GB on a single card — go to MI300X 192GB.

For a 200-MAU SaaS RAG, the 4090 is the right answer. For a regional bank running a Llama 70B FP8 audit-grade endpoint with 200+ concurrent sessions and ECC requirements, the H100 is the only credible choice.

Don’t pay datacentre prices for consumer-tier workloads

GigaGPU’s UK dedicated hosting offers the RTX 4090 24GB at a fraction of H100 cost — perfectly sized for the workloads that don’t actually need HBM3.

Order the RTX 4090 24GB

See also: vs A100 80GB, vs MI300X 192GB, vs RTX 6000 Pro 96GB, vs cloud H100, RTX 4090 spec breakdown, FP8 tensor cores on Ada, 2026 tier positioning.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?