If your RTX 4090 24GB deployment is hitting throughput or VRAM ceilings, the RTX 5090 32GB is the natural next step on the consumer-derived inference ladder. Same form factor, same operational pattern, dramatically more bandwidth, native FP4 tensor cores, and a 33% bigger VRAM ceiling that turns several “tight” workloads into “comfortable” ones. This guide lays out the spec delta, the real-world throughput uplift across LLM and diffusion workloads, the price differential, the payback timeline, and the migration checklist – in the context of the wider UK GPU range.
Contents
- Spec delta: Ada AD102 vs Blackwell GB202
- Throughput uplift across workloads
- VRAM headroom: 24GB to 32GB
- Cost differential and per-token economics
- When the upgrade justifies itself
- Things to try before upgrading
- Migration checklist
- Payback timeline and verdict
Spec delta: Ada AD102 vs Blackwell GB202
The 5090 is two architectural generations on from the 4090. Most of the inference uplift comes from memory bandwidth, with FP4 native tensor cores and PCIe Gen5 thrown in for forward-compatibility.
| Spec | RTX 4090 24GB | RTX 5090 32GB | Change |
|---|---|---|---|
| Architecture | Ada AD102 | Blackwell GB202 | +2 generations |
| CUDA cores | 16,384 | 21,760 | +33% |
| Tensor cores | 512 (4th gen) | 680 (5th gen) | +33%, FP4 native |
| VRAM | 24GB GDDR6X | 32GB GDDR7 | +8GB (+33%) |
| Bandwidth | 1,008 GB/s | 1,792 GB/s | +78% |
| TDP | 450W | 575W | +125W (+28%) |
| PCIe | Gen4 x16 | Gen5 x16 | 2x lane bandwidth |
| FP16 TFLOPS | 165 | ~280 | +70% |
| FP8 TFLOPS (sparse) | ~660 | ~1,100 | +67% |
| FP4 support | No | Yes | New format path |
| NVLink | No | No | Unchanged |
The bandwidth jump is the headline number for inference. Decode is bandwidth-bound on every transformer LLM, so a 78% bigger pipe to GDDR7 translates almost linearly into per-token throughput on small models and scales further on aggregate batch.
Throughput uplift across workloads
Across LLM, diffusion, and audio workloads the 5090 lands between 1.4x and 1.7x the 4090. The exact uplift depends on whether the workload is bandwidth-bound (LLM decode), compute-bound (SDXL, prefill), or mixed.
| Workload | 4090 t/s or s/img | 5090 t/s or s/img | Uplift |
|---|---|---|---|
| Llama 3.1 8B FP8 batch 1 | 198 t/s | 280 t/s | +41% |
| Llama 3.1 8B FP8 aggregate batch 32 | 1,100 t/s | 1,700 t/s | +55% |
| Llama 3.1 70B AWQ INT4 batch 1 | 22 t/s | 36 t/s | +64% |
| Llama 3.1 70B AWQ INT4 concurrency 4 | ~75 aggr | ~125 aggr | +67% |
| Llama 3.1 70B FP8 batch 1 | OOM | ~30 (tight) | n/a |
| Qwen 2.5 14B FP8 batch 1 | 120 t/s | 175 t/s | +46% |
| Mixtral 8x7B AWQ batch 1 | 78 t/s | 120 t/s | +54% |
| SDXL 1024×1024 30 steps | 3.4s | 2.1s | +62% throughput |
| Flux.1 Dev 1024×1024 | 14s | 8.5s | +65% |
| Whisper Large v3 1hr audio | 22s | 14s | +57% |
| Llama 8B FP8 t/J (efficiency) | 3.4 | 3.4 | Tied |
Tokens-per-joule stay roughly flat – the 5090 burns +28% more power for +44% more throughput on small models, so efficiency is a wash. This matters for the tokens-per-watt calculation but rarely changes the upgrade decision.
VRAM headroom: 24GB to 32GB
The extra 8GB looks modest on paper but unlocks specific workloads at the boundary. Llama 70B FP8 (38GB) still does not fit on a single 5090, but everything in the 24-32GB band – Qwen 32B FP8, Mixtral 8x7B AWQ with full KV, 128k-context 8B – moves from “tight or OOM” to “comfortable”.
| Workload | 4090 24GB | 5090 32GB |
|---|---|---|
| Llama 8B FP8 + KV | 16GB free for KV (~32k context) | 24GB free for KV (~96k context) |
| Llama 70B AWQ INT4 | FP8 KV needed, batch 1-2 | FP16 KV OK, batch 4 comfortable |
| Llama 70B FP8 | OOM (38GB needed) | OOM (still 6GB short) |
| Qwen 2.5 32B FP8 | Tight, may OOM at concurrency | Fits with KV headroom |
| Mixtral 8x7B AWQ INT4 | ~25GB – swap risk | Fits comfortably |
| 128k context Llama 3.1 8B | Tight – paged-attention pressure | Comfortable |
| SDXL + ControlNet + IP-Adapter | Tight | Comfortable |
If your symptom is purely VRAM (“the model does not fit”), the 5090 only solves it for the 24-32GB tier. Anything north of 32GB is RTX 6000 Pro 96GB or H100 80GB territory – covered in the 6000 Pro upgrade post.
Cost differential and per-token economics
Indicative UK dedicated hosting in 2026:
| Card | £/month | £/year | Throughput uplift | £/M tokens (8B FP8) |
|---|---|---|---|---|
| RTX 4090 24GB | £575 | £6,900 | baseline | £0.039 |
| RTX 5090 32GB | £900 | £10,800 | +44-55% | £0.041 |
| Delta | +£325/mo | +£3,900/yr | +50% headline | +5% |
The 5090 costs ~57% more for ~50% more throughput on bandwidth-bound workloads, so cost-per-token actually rises by a few percent. The upgrade does not pay for itself on raw economics. It pays for itself on indirect benefits: extra VRAM enables previously-impossible workloads, lower TTFT helps user-facing UX metrics, and FP4 readiness future-proofs the deployment for 2026-era models. See the monthly hosting cost guide for the full TCO breakdown and the vs 5090 spec deep-dive.
When the upgrade justifies itself
- You need Qwen 2.5 32B FP8 with full KV at production batch sizes
- You need 128k-context inference at low latency
- You serve consumer-facing chat where TTFT below 250ms is a hard SLA
- Your 4090 hits aggregate throughput ceiling at p95 traffic and you want headroom not horizontal scale
- You are evaluating FP4 quantised models (Blackwell native, Ada cannot)
- You are sizing for a 2-3 year deployment and want bandwidth headroom for the next model release
- Your image generation pipeline (SDXL, Flux, ComfyUI) blocks the LLM during peak load
- You run Mixtral 8x7B and need full KV without paged-attention pressure
If none of these apply and your workload is purely 8B FP8 chat or 70B AWQ INT4, stay on the 4090 – cost-per-token is meaningfully better.
Things to try before upgrading
Before you spend an extra £3,900/year, run through this list. Each item is one config change away.
- FP8 KV cache. Set
--kv-cache-dtype fp8. Doubles your effective KV memory at sub-1% quality cost. Documented in the vLLM setup guide. - Speculative decoding. A 1B draft model in front of an 8B target model can lift decode 1.5-2x at <5% extra VRAM.
- Prefix caching. RAG workloads with stable system prompts see 30-60% TTFT reduction.
- Move embeddings off the LLM card. A £75/month 5060 Ti hosts BGE-M3 – see the hybrid pairing.
- Reduce
--max-model-len. 128k allocations leak even when most requests are 4k. - Increase
--gpu-memory-utilization. Default 0.90 leaves ~2.4GB unused on a 4090.
Migration checklist
# vLLM launch on Blackwell GB202 (5090)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct-FP8 \
--quantization fp8 --kv-cache-dtype fp8 \
--max-model-len 32768 --max-num-seqs 16 \
--gpu-memory-utilization 0.94
- Confirm vLLM/SGLang/TensorRT-LLM versions support Blackwell sm_120 (vLLM 0.6.4+, TensorRT-LLM 0.13+, SGLang 0.4+)
- Ensure CUDA 12.8 or later in your container images – 12.4 will boot but run slow paths
- Re-quantise FP4 models if you want to use the new format – existing FP8 weights still run
- Re-benchmark KV cache settings – 32GB allows larger
--gpu-memory-utilizationand bigger--max-num-seqs - Update PCIe Gen5-aware NIC drivers if your host supports them
- Validate power: the host PSU needs headroom for 575W sustained plus host overhead – 1000W minimum, 1200W recommended
- Verify chassis cooling – the 5090 reference design exhausts hotter than the 4090 and benefits from rear-mounted exhaust fans
Production gotchas
- Driver branch confusion. Blackwell needs NVIDIA driver 555+. Some LTS distros pin to 535. Check with
nvidia-smibefore deploying. - FP4 weight format is not interchangeable. An FP4-quantised Llama 8B will not load on a 4090. Keep both checkpoints during a phased rollout.
- vLLM 0.6.3 silent fallback. Earlier vLLM runs but uses sm_90 paths on Blackwell. Throughput is ~30% below spec. Always pin 0.6.4+.
- Power excursions. 575W TDP is sustained, transients hit 600W+. Cheap PSUs trip on the spike even with adequate average headroom.
- FlashAttention 3 required. FA2 works but FA3 unlocks the Blackwell speed-ups. Pin
flash-attn>=3.0. - Container CUDA mismatch. NGC containers on CUDA 12.4 install fine but cuBLAS Blackwell kernels are missing. Rebuild on 12.8.
- KV reservation surprise. 32GB cards encourage
--max-num-seqs 32defaults that worked on H100 but starve activation buffers on a single 5090.
Payback timeline and verdict
For pure-throughput workloads the 5090 does not pay for itself – cost-per-token is slightly worse. The upgrade earns its keep on indirect economics:
- Capacity unlock – serving traffic the 4090 turned away. Payback is immediate at the moment you unblock revenue.
- UX threshold – if a sub-250ms TTFT is contractually required, the 5090 is the cheapest way to hit it on a single card.
- Roadmap insurance – 2026-era frontier models will increasingly target FP4 and 32k-128k native context. The 5090 carries the deployment forward 18-24 months without another forklift.
Verdict. Upgrade if VRAM pressure, TTFT SLA, or FP4 roadmap pressure is real. Stay on the 4090 if cost-per-token is the dominant metric and your model menu is stable in the 8B-14B FP8 / 70B AWQ range. The decision logic is split out in the 4090-or-5090 decision post.
The proven workhorse, before you scale up
The RTX 4090 24GB still delivers best-in-class cost-per-token for FP8 chat. UK dedicated hosting.
Order the RTX 4090 24GBSee also: 4090 vs 5090 decision, spec deep-dive, when to upgrade, upgrade to 6000 Pro, spec breakdown, tier positioning 2026, FP8 Llama deployment.