The RTX 5090 32GB brings Blackwell’s GB202 die, GDDR7, and 1.79 TB/s of bandwidth – the most aggressive consumer-class GPU NVIDIA has shipped to date. The RTX 4090 24GB is a known quantity with mature tooling, native 4th-gen FP8, and a price point UK teams already budget for. The 5090 wins on raw throughput and on VRAM headroom; the 4090 wins on £/throughput and on tooling stability. This decision guide pits them head to head with concrete throughput numbers, watts-per-token efficiency, a per-pound metric, and a 10-workload winner table, anchored to dedicated 4090 hosting at GigaGPU. Both cards live in the wider UK GPU range.
Contents
- Spec delta
- Throughput comparison
- VRAM headroom: 24GB vs 32GB
- Per-pound and per-watt performance
- Per-workload winner (10 workloads)
- FP4 native: what Blackwell unlocks
- Production gotchas
- Verdict and when each card wins
Spec delta
| Spec | RTX 4090 24GB | RTX 5090 32GB | Delta |
|---|---|---|---|
| Architecture | Ada AD102 | Blackwell GB202 | 2 generations |
| CUDA cores | 16,384 | 21,760 | +33% |
| Tensor cores | 512 (4th gen) | 680 (5th gen) | +33%, FP4 native |
| VRAM | 24GB GDDR6X | 32GB GDDR7 | +33% |
| Bandwidth | 1,008 GB/s | 1,792 GB/s | +78% |
| TDP | 450W | 575W | +28% |
| FP8 generation | 4th gen | 5th gen | Faster matmul + FP4 |
| FP4 native | No | Yes | New format support |
| NVLink | No | No (PCIe Gen5 x16) | 5090 has Gen5 (2x lane bandwidth) |
| FP16 TFLOPS dense | 165 | ~280 | +70% |
| Approx UK dedicated £/mo | £550 | £900 | +£350/mo, +64% |
Throughput comparison
Sustained vLLM throughput at batch 1 and at typical concurrency. The 5090’s bigger memory bandwidth does most of the work for inference – LLM decode is bandwidth-bound. The +33% CUDA cores helps batching and prefill. The 5th-gen tensor cores bring small per-op efficiency gains on FP8.
| Workload | 4090 t/s | 5090 t/s | Speedup |
|---|---|---|---|
| Llama 3.1 8B FP8 batch 1 | 198 | ~280 | 1.41x |
| Llama 3.1 8B FP8 concurrency 8 | ~1,100 aggr | ~1,650 aggr | 1.50x |
| Llama 3.1 8B FP8 concurrency 32 | ~1,800 aggr | ~2,800 aggr | 1.56x |
| Llama 3.1 70B AWQ INT4 batch 1 | 22 | ~36 | 1.64x |
| Llama 3.1 70B AWQ INT4 conc 4 | ~110 aggr | ~180 aggr | 1.64x |
| Llama 3.1 70B FP8 batch 1 | OOM at 24GB | ~30 (32GB tight) | n/a |
| Qwen 2.5 32B FP8 | OOM tight | ~85 | n/a |
| SDXL 1024×1024, 30 steps | 3.4s | 2.1s | 1.62x |
| Flux.1 Dev 1024×1024 | 14s | 8.5s | 1.65x |
| Whisper Large v3, 1hr audio | 22s | 14s | 1.57x |
VRAM headroom: 24GB vs 32GB
The extra 8GB on the 5090 unlocks specific workloads. Most importantly: Llama 70B FP8 (~38GB nominal but with paged KV and tight quantisation it can squeeze on 32GB at low concurrency) becomes marginally feasible. Qwen 32B in FP8 fits with full KV. Mixtral 8x7B AWQ moves from “tight” on 4090 to “comfortable” on 5090. 128k context windows on 8B become realistic.
| Model | 4090 24GB | 5090 32GB |
|---|---|---|
| Llama 8B FP8 (4k context) | 16GB free for KV | 24GB free for KV (3x context) |
| Llama 8B FP8 (128k context) | Tight, FP8 KV needed | Comfortable |
| Llama 70B AWQ INT4 | Fits, FP8 KV needed | Comfortable, FP16 KV OK |
| Llama 70B FP8 | OOM (~38GB) | Tight but possible at conc 1-2 |
| Qwen 32B FP8 | ~32GB – OOM | Fits |
| Mixtral 8x7B AWQ | ~25GB – swap risk | Fits cleanly |
| Flux.1 Dev BF16 | Fits with offload | Fits cleanly |
| Llama 405B AWQ INT4 | OOM | OOM |
Per-pound and per-watt performance
Assume £550/month for a dedicated 4090 and ~£900/month for a dedicated 5090. The 5090 gives ~1.5x the inference throughput at ~1.64x the price – so per-pound the 4090 still leads on raw t/s/£ on bandwidth-bound workloads. The 5090 wins on per-watt efficiency at sustained throughput because the GDDR7 is meaningfully more efficient per byte transferred.
| Metric | 4090 | 5090 | Winner |
|---|---|---|---|
| Llama 8B FP8 t/s per £/mo | 2.00 | 1.83 | 4090 |
| Llama 70B INT4 t/s per £/mo | 0.20 | 0.20 | Tied |
| SDXL £/image (24/7 queue) | £0.0009 | £0.0010 | 4090 |
| Llama 8B FP8 t/s per W (TDP) | 0.44 | 0.49 | 5090 |
| Llama 70B INT4 t/s per W | 0.049 | 0.063 | 5090 |
| Workloads that fit on one card | Most 24GB-class | 32GB-class too | 5090 |
Per-workload winner (10 workloads)
| Workload | 4090 wins | 5090 wins | Why |
|---|---|---|---|
| Llama 8B FP8 chat API conc 8 | Yes | No | £/token favours 4090 |
| Llama 70B AWQ INT4 single card | Yes | No | Tied per token, 4090 cheaper |
| Llama 70B FP8 single card | No | Yes | Only 5090 fits |
| Qwen 32B FP8 | No | Yes | Only 5090 fits |
| SDXL 24/7 image queue | Yes | No | £/image favours 4090 |
| Flux.1 Dev image gen | Marginal | Yes | 5090 1.65x faster, less offload needed |
| Sub-100ms TTFT chat at high conc | No | Yes | 5090 latency advantage |
| FP4 quantised inference | No | Yes | 4090 cannot do native FP4 |
| 128k context Llama 8B | Tight | Yes | 5090 KV headroom |
| 3-year deployment, future-proof | No | Yes | Blackwell longer support window |
FP4 native: what Blackwell unlocks
The 5090 has hardware support for FP4 (E2M1 and E3M0 formats). For inference, FP4 weights halve memory footprint vs FP8 with quality loss roughly comparable to AWQ INT4 – but with much faster matmul because the tensor cores execute it natively. Llama 70B in FP4 fits in ~22GB of weights, well within the 5090’s 32GB. The 4090 has no native FP4 path – emulation through INT4 kernels exists but loses the architectural advantage.
As of 2026 the FP4 ecosystem is still maturing. vLLM, TensorRT-LLM and SGLang have FP4 paths but with more limited model coverage than FP8. If your roadmap includes adopting FP4 in 2026-2027, the 5090 is the obvious target. If you’re optimising for known-good FP8 today, the 4090 is the safer bet.
Production gotchas
- 5090 needs CUDA 12.8+ and recent inference stacks. vLLM 0.6+, TensorRT-LLM 0.13+, SGLang 0.4+. Older container images will not detect sm_120 correctly.
- 575W power envelope. 5090 host PSU needs 1000W+ headroom. Many 1U/2U dedicated chassis cannot accommodate it – confirm with hosting provider.
- 5090 supply remains tight in 2026. Capacity at most providers is rationed. Lead times for dedicated 5090 hosting can exceed a week.
- FP4 tooling immaturity. The format is supported but not all model variants ship with FP4 quantised weights yet. Check model availability before betting on FP4 throughput.
- Cooling at 575W in dense racks. Sustained inference loads pull peak TDP. Inadequate airflow throttles to ~480W and loses 15-20% throughput.
- 4090 mature but EOL approaching. NVIDIA driver support continues but new architectural features ship Blackwell-first. Plan for a 2-3 year operational window on Ada.
- PCIe Gen5 advantage rarely matters. Most inference workloads do not saturate PCIe Gen4 x16 (32 GB/s each direction). Gen5’s doubling helps mainly multi-GPU training and very large prefill batches.
Verdict and when each card wins
For pure cost-efficiency on workloads that already fit on 24GB (Llama 8B FP8 chat APIs, Qwen 14B, Mistral 7B, Llama 70B AWQ INT4, SDXL), the 4090 stays the better buy in 2026 – cost-per-token wins by roughly 8-10%. For 32GB-class models (Llama 70B FP8 single-card, Qwen 32B FP8, Mixtral 8x7B comfortable, 128k context windows), FP4 experimentation, sub-100ms TTFT requirements, or a 2-3 year deployment that wants Blackwell’s longer support window, the 5090 justifies the £350/mo premium. If you cannot decide, the 4090 is the safer financial choice today and the 5090 is the better long-term bet. Many teams run both: 4090s for the cost-bound 8B chat fleet and a 5090 or two for the larger-model tier. Order via GigaGPU dedicated hosting.
Proven Ada workhorse
24GB GDDR6X, native 4th-gen FP8, mature tooling, best £/token for 24GB-class workloads. UK dedicated hosting.
Order the RTX 4090 24GBSee also: 4090 vs 5090 spec deep-dive, upgrade path to 5090, when to upgrade, FP8 tensor cores, spec breakdown, tier positioning 2026, Llama 8B benchmark, Llama 70B INT4 benchmark, tokens per watt, or 3090 decision, or 5080 decision, or 5060 Ti decision, multi-card pairing, vs cloud H100, 5090 vs 3090, power draw efficiency, 70B INT4 VRAM.