Table of Contents
H100 80 GB SXM is the king of the datacenter. RTX 5090 32 GB is the king of consumer Blackwell. For inference workloads, the choice is more nuanced than the spec sheet suggests.
H100 wins on raw FP16 throughput (~5×) and HBM bandwidth (~2×). For 7B-13B inference workloads, RTX 5090 is ~50% as fast at ~10% the cost — much better cost-per-token. H100 wins on training and 70B+ inference at scale.
Specs
| Spec | RTX 5090 | H100 80 GB SXM5 |
|---|---|---|
| VRAM | 32 GB GDDR7 | 80 GB HBM3 |
| Memory bandwidth | 1,792 GB/s | 3,350 GB/s |
| FP16 TFLOPS | ~210 | ~989 |
| FP8 TOPS | ~838 | ~3,958 |
| Monthly (rental) | £399 | POA (~£3,000+) |
Inference comparison
| Workload | RTX 5090 | H100 | Notes |
|---|---|---|---|
| Mistral 7B FP8 | 1,920 tok/s | ~3,500 tok/s | H100 1.8× faster, 8× cost |
| Llama 3 70B FP8 | doesn't fit | ~600 tok/s | H100 wins decisively |
| Cost per 1M tokens (7B) | £0.12 | ~£0.50 | 5090 4× cheaper |
Verdict
For 7B-13B inference, RTX 5090 dominates on cost-per-token. For 70B+ or large-cluster training, H100 is the right card.
Bottom line
Match GPU to workload size. H100 is overkill for 8B chatbots. See RTX 5090 hosting.