The RTX 5090 is Blackwell’s consumer flagship and the natural successor to the RTX 4090. If you are choosing between the two for AI inference on UK GPU hosting, you are weighing 33% more VRAM, 78% more memory bandwidth, native FP4, 5th-generation tensor cores and PCIe Gen 5 against a more proven, lower-power Ada platform with two extra years of toolchain maturity. The RTX 4090 24GB remains the value pick for many production workloads — this post walks through exactly which workloads pay the 5090 premium back, which do not, and where 32GB unlocks behaviour the 4090 simply cannot match without sharding.
Contents
- Spec sheet side by side
- 5th-gen vs 4th-gen tensor cores and FP4
- Bandwidth, GDDR7 and the cache hierarchy
- Throughput uplift across nine workloads
- When 32GB unlocks new workloads
- Power, economics and tokens-per-watt
- Per-workload winner table
- vLLM serving on each card
- Production gotchas with the 5090
- Which to pick
Spec sheet side by side
Blackwell is not a clean rebuild of Ada in the way Ada was of Ampere. NVIDIA reused the basic SM block diagram and concentrated its silicon budget on the tensor cores, the memory subsystem and a more aggressive Transformer Engine. The result is a card that scales primarily with bandwidth and tensor-core count rather than CUDA-core count.
| Spec | RTX 4090 (Ada AD102) | RTX 5090 (Blackwell GB202) | Delta |
|---|---|---|---|
| Process node | TSMC 4N | TSMC 4NP | Refined node |
| SM count | 128 | 170 | +33% |
| CUDA cores | 16,384 | 21,760 | +33% |
| Tensor cores | 512 (4th gen) | 680 (5th gen) | +33%, FP4 added |
| RT cores | 128 (3rd gen) | 170 (4th gen) | +33% |
| Boost clock | 2.52 GHz | 2.41 GHz | -4% (more SMs) |
| VRAM | 24 GB GDDR6X (21 Gbps) | 32 GB GDDR7 (28 Gbps) | +33% capacity |
| Memory bandwidth | 1008 GB/s | 1792 GB/s | +78% |
| Memory bus | 384-bit | 512-bit | +33% |
| L2 cache | 72 MB | ~96 MB | +33% |
| FP16 dense TFLOPS | 165 | ~209 | +27% |
| FP8 dense TFLOPS | 660 (with sparsity) | ~838 | +27% |
| FP4 dense TFLOPS | None | ~1676 | New format |
| TDP | 450W | 575W | +28% |
| PCIe | Gen 4 x16 | Gen 5 x16 | 2x effective |
| NVENC / NVDEC | 2x 8th + 1x 5th | 3x 9th + 2x 6th | More streams |
Bandwidth is the headline. Going from 21 Gbps GDDR6X on a 384-bit bus to 28 Gbps GDDR7 on a 512-bit bus pushes the 5090 to 1.79 TB/s — close to A100 PCIe territory and almost double the 4090. For decode-bound LLM inference where the bottleneck is reading weights from VRAM into the tensor cores, that uplift translates directly into single-stream tokens per second. Compare against the RTX 4090 spec breakdown for full context.
5th-gen vs 4th-gen tensor cores and FP4
4th-gen tensor cores on Ada introduced FP8 (E4M3 and E5M2). 5th-gen on Blackwell keeps both and adds FP4 (E2M1 and a microscaling MX-FP4 variant), doubling the theoretical throughput once again for any operator that tolerates 4-bit precision. The catch is that FP4 inference quality is sensitive to calibration — Llama 3 8B in MX-FP4 holds within 0.5 points of FP8 on MMLU, but Qwen 2.5 32B drops 1.5-2.0 points. For coding workloads where exactness matters (Qwen Coder, DeepSeek Coder) FP8 remains the safer default even on Blackwell. The Transformer Engine in CUDA 12.5+ on Blackwell silently mixes FP8 and FP4 per layer, but you should validate against your eval suite rather than trust the default.
Operationally: FP8 on Ada equals FP8 on Blackwell at the same model — the 5090 is faster mostly because it has more tensor cores and far more bandwidth, not because Blackwell does FP8 differently. The FP4 path is the genuinely new capability, useful chiefly for the largest models where 8-bit weights cannot fit. See FP8 tensor cores on Ada for the architectural counterpart.
Bandwidth, GDDR7 and the cache hierarchy
Single-stream LLM decode is bandwidth-bound: at batch 1, almost every token requires reading the model weights from VRAM. On Llama 3 8B FP8 the 4090’s 1008 GB/s sustains ~198 t/s; the 5090’s 1792 GB/s sustains ~280 t/s — a 1.41x uplift, almost exactly tracking the bandwidth ratio (1.78x reduced by the L2 absorbing the rest). On batched inference where compute matters more, the gap closes: at batch 32 aggregate throughput is roughly 1100 t/s on the 4090 vs 1500 t/s on the 5090 (1.36x), and on prefill-heavy RAG it can drop to 1.25x.
Throughput uplift across nine workloads
| Workload | RTX 4090 | RTX 5090 | Uplift |
|---|---|---|---|
| Llama 3.1 8B FP8 decode b1 | 198 t/s | 280 t/s | 1.41x |
| Llama 3.1 8B FP8 batch 32 agg | 1100 t/s | 1500 t/s | 1.36x |
| Llama 3.1 70B AWQ INT4 decode b1 | 22-24 t/s | 34-37 t/s | 1.55x |
| Llama 3.1 70B FP8 (does not fit on 4090) | n/a | 27 t/s | n/a |
| Qwen 2.5 32B AWQ decode b1 | 65 t/s | 92 t/s | 1.42x |
| SDXL 1024×1024 30-step | 2.0s | 1.4s | 1.43x |
| FLUX.1-dev FP8 30-step | 4.1s | 2.7s | 1.52x |
| Whisper large-v3-turbo INT8 | 80x RT | 130x RT | 1.63x |
| QLoRA Llama 8B (steps/s) | 2.6 | 3.7 | 1.42x |
Across these nine workloads the 5090 is consistently 1.35-1.65x the 4090. The big exception is anything that needs 70B at FP8: the 4090 simply cannot fit it, while the 5090 can with 8k context. See the Llama 70B INT4 benchmark for the AWQ baseline and the Llama 3 8B benchmark for the 8B numbers.
When 32GB unlocks new workloads
| Workload | RTX 4090 24GB | RTX 5090 32GB |
|---|---|---|
| Llama 3.1 70B AWQ INT4 + 16k FP8 KV | Tight, –max-num-seqs 4 | Comfortable, 32k context |
| Llama 3.1 70B FP8 (35 GB weights) | OOM | Fits with 8k context |
| Mixtral 8x22B AWQ (74 GB) | OOM | OOM |
| Qwen 2.5 72B AWQ INT4 | Tight, 4k context | Fits, 32k context |
| SDXL training LoRA + Refiner cached | Tight | Comfortable |
| FLUX.1-dev FP16 (22 GB peak) | Risky | Comfortable |
| Llama Vision 11B + KV at 32k | Tight | Comfortable |
The four workloads where 32GB genuinely matters are: Llama 70B at FP8 (instead of AWQ), Qwen 72B at long context, FLUX in full FP16 with caching, and any large-context vision model. For everything else, the 24GB on the 4090 is sufficient.
Power, economics and tokens-per-watt
| Metric | RTX 4090 | RTX 5090 |
|---|---|---|
| TDP | 450W | 575W |
| Sustained LLM batch 32 power | 360W | 460W |
| Aggregate t/s on Llama 3 8B FP8 b32 | 1100 | 1500 |
| Tokens/Joule | 3.05 | 3.26 |
| UK price (typical 2026) | £1,300 | £2,100 |
| £/aggregate t/s (b32) | £1.18 | £1.40 |
| £/decode t/s (b1) | £6.57 | £7.50 |
| Annual electricity @ 24/7 £0.18/kWh | £568 | £725 |
The per-pound numbers favour the 4090. Per-watt slightly favours the 5090. For a fleet of ten cards serving a multi-tenant SaaS, the 4090 is currently 18-25% cheaper per delivered token. The 5090 wins decisively only when you need the VRAM. See the tokens-per-watt and vs OpenAI API cost analyses.
Per-workload winner table
| Workload | Winner | Why |
|---|---|---|
| 200-MAU SaaS RAG on Llama 8B | 4090 | Better £/perf, plenty of headroom |
| 12-engineer Qwen Coder 32B AWQ team | 4090 | 65 t/s exceeds typing speed already |
| Single-tenant Llama 70B at FP8 | 5090 | 4090 cannot fit FP8 |
| Llama 70B AWQ INT4 with 32k context | 5090 | 4090 caps at 16k |
| SDXL studio, 500 imgs/day | 4090 | 2.0s suffices, lower power |
| FLUX.1-dev production at scale | 5090 | 1.5x faster, 32GB headroom |
| Voice agent (Whisper + TTS) | 4090 | 80x RT is overkill already |
| Multi-tenant 70B FP8 endpoint | 5090 | VRAM unlocks single-card serve |
| Tokens-per-watt-optimised hosting | 5090 | Marginally better t/J |
| Capex-constrained startup | 4090 | 40% cheaper, 70% the throughput |
vLLM serving on each card
The configurations look almost identical because vLLM abstracts the differences. The 5090 commands let you crank --max-num-seqs up and use longer contexts.
# RTX 4090 — Llama 3.1 8B FP8, 16k context, 32-way batching
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--quantization fp8 --kv-cache-dtype fp8_e4m3 \
--max-model-len 16384 --max-num-seqs 32 \
--gpu-memory-utilization 0.92
# RTX 5090 — same model, 64k context, 64-way batching
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--quantization fp8 --kv-cache-dtype fp8_e4m3 \
--max-model-len 65536 --max-num-seqs 64 \
--gpu-memory-utilization 0.92
# RTX 5090 only — Llama 3.1 70B at FP8 (does not fit on 4090)
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
--quantization fp8 --kv-cache-dtype fp8_e4m3 \
--max-model-len 8192 --max-num-seqs 4 \
--gpu-memory-utilization 0.94
Production gotchas with the 5090
- 575W power draw is a chassis problem. Not every 4U server with 4090 cooling can keep a 5090 under thermal limits at sustained load. Expect to see throttling on any chassis without dedicated 5090 airflow design.
- 12V-2×6 connectors only. The older 12VHPWR is technically compatible but the seating issues that plagued early 4090 reports recur. Use 12V-2×6 cables and seat them firmly.
- PCIe Gen 5 only matters for multi-card. Single-card inference does not saturate Gen 4 x16. The benefit appears only when you do tensor-parallel inference or NCCL all-reduce across cards.
- FP4 calibration is non-trivial. Do not just flip
--quantization fp4in production without an eval pass. Some models lose 2-3 MMLU points. - Driver 555+ required. Older driver branches do not expose 5th-gen tensor instructions correctly. Pin your container base image accordingly.
- NVENC 9th gen is great but not yet matched in FFmpeg releases. If you do video transcoding alongside inference, check that your FFmpeg build includes 9th-gen NVENC support.
- Resale risk. The 5090’s high price means depreciation is steeper than the 4090, which is already at its second-hand floor. Consider this when sizing capex.
Which to pick
- Pick the 4090 24GB if your model fits in 24GB at FP8 or AWQ; you are price-sensitive; you serve fewer than 100 concurrent users; or you want the lowest £/delivered-token. See the 4090 or 5090 decision guide.
- Pick the 5090 32GB if you need 70B+ at FP8 on a single card; you need 32k+ context on Qwen 72B; you want FP4 for the largest models; or you are building a long-tenure rack where the 1.4x throughput compounds.
- Pick neither if you actually need 96GB on one card — go to RTX 6000 Pro 96GB or an H100 80GB.
For a 200-MAU SaaS RAG on Llama 8B FP8, the 4090 is the clear choice. For a 70B-FP8 internal endpoint where capex is amortised over 24 months, the 5090 wins. For a 12-engineer coding team on Qwen 32B AWQ, both work, but the 4090 saves £800 per node.
Provision a 4090 today, evaluate the 5090 in three months
GigaGPU’s UK dedicated hosting offers the RTX 4090 24GB on properly cooled, FP8-ready images. Start serving today; migrate to a 5090 when you actually need the VRAM.
Order the RTX 4090 24GBSee also: RTX 4090 spec breakdown, 2026 tier positioning, FP8 tensor cores on Ada, 4090 or 5090 decision, when to upgrade, Llama 70B INT4 deployment, RTX 4090 vs RTX 3090.