Upgrading From RTX 4090 24GB to RTX 5090 32GB GIGAGPU

If your RTX 4090 24GB deployment is hitting throughput or VRAM ceilings, the RTX 5090 32GB is the natural next step on the consumer-derived inference ladder. Same form factor, same operational pattern, dramatically more bandwidth, native FP4 tensor cores, and a 33% bigger VRAM ceiling that turns several “tight” workloads into “comfortable” ones. This guide lays out the spec delta, the real-world throughput uplift across LLM and diffusion workloads, the price differential, the payback timeline, and the migration checklist – in the context of the wider UK GPU range.

Spec delta: Ada AD102 vs Blackwell GB202

The 5090 is two architectural generations on from the 4090. Most of the inference uplift comes from memory bandwidth, with FP4 native tensor cores and PCIe Gen5 thrown in for forward-compatibility.

Spec	RTX 4090 24GB	RTX 5090 32GB	Change
Architecture	Ada AD102	Blackwell GB202	+2 generations
CUDA cores	16,384	21,760	+33%
Tensor cores	512 (4th gen)	680 (5th gen)	+33%, FP4 native
VRAM	24GB GDDR6X	32GB GDDR7	+8GB (+33%)
Bandwidth	1,008 GB/s	1,792 GB/s	+78%
TDP	450W	575W	+125W (+28%)
PCIe	Gen4 x16	Gen5 x16	2x lane bandwidth
FP16 TFLOPS	165	~280	+70%
FP8 TFLOPS (sparse)	~660	~1,100	+67%
FP4 support	No	Yes	New format path
NVLink	No	No	Unchanged

The bandwidth jump is the headline number for inference. Decode is bandwidth-bound on every transformer LLM, so a 78% bigger pipe to GDDR7 translates almost linearly into per-token throughput on small models and scales further on aggregate batch.

Throughput uplift across workloads

Across LLM, diffusion, and audio workloads the 5090 lands between 1.4x and 1.7x the 4090. The exact uplift depends on whether the workload is bandwidth-bound (LLM decode), compute-bound (SDXL, prefill), or mixed.

Workload	4090 t/s or s/img	5090 t/s or s/img	Uplift
Llama 3.1 8B FP8 batch 1	198 t/s	280 t/s	+41%
Llama 3.1 8B FP8 aggregate batch 32	1,100 t/s	1,700 t/s	+55%
Llama 3.1 70B AWQ INT4 batch 1	22 t/s	36 t/s	+64%
Llama 3.1 70B AWQ INT4 concurrency 4	~75 aggr	~125 aggr	+67%
Llama 3.1 70B FP8 batch 1	OOM	~30 (tight)	n/a
Qwen 2.5 14B FP8 batch 1	120 t/s	175 t/s	+46%
Mixtral 8x7B AWQ batch 1	78 t/s	120 t/s	+54%
SDXL 1024×1024 30 steps	3.4s	2.1s	+62% throughput
Flux.1 Dev 1024×1024	14s	8.5s	+65%
Whisper Large v3 1hr audio	22s	14s	+57%
Llama 8B FP8 t/J (efficiency)	3.4	3.4	Tied

Tokens-per-joule stay roughly flat – the 5090 burns +28% more power for +44% more throughput on small models, so efficiency is a wash. This matters for the tokens-per-watt calculation but rarely changes the upgrade decision.

VRAM headroom: 24GB to 32GB

The extra 8GB looks modest on paper but unlocks specific workloads at the boundary. Llama 70B FP8 (38GB) still does not fit on a single 5090, but everything in the 24-32GB band – Qwen 32B FP8, Mixtral 8x7B AWQ with full KV, 128k-context 8B – moves from “tight or OOM” to “comfortable”.

Workload	4090 24GB	5090 32GB
Llama 8B FP8 + KV	16GB free for KV (~32k context)	24GB free for KV (~96k context)
Llama 70B AWQ INT4	FP8 KV needed, batch 1-2	FP16 KV OK, batch 4 comfortable
Llama 70B FP8	OOM (38GB needed)	OOM (still 6GB short)
Qwen 2.5 32B FP8	Tight, may OOM at concurrency	Fits with KV headroom
Mixtral 8x7B AWQ INT4	~25GB – swap risk	Fits comfortably
128k context Llama 3.1 8B	Tight – paged-attention pressure	Comfortable
SDXL + ControlNet + IP-Adapter	Tight	Comfortable

If your symptom is purely VRAM (“the model does not fit”), the 5090 only solves it for the 24-32GB tier. Anything north of 32GB is RTX 6000 Pro 96GB or H100 80GB territory – covered in the 6000 Pro upgrade post.

Cost differential and per-token economics

Indicative UK dedicated hosting in 2026:

Card	£/month	£/year	Throughput uplift	£/M tokens (8B FP8)
RTX 4090 24GB	£575	£6,900	baseline	£0.039
RTX 5090 32GB	£900	£10,800	+44-55%	£0.041
Delta	+£325/mo	+£3,900/yr	+50% headline	+5%

The 5090 costs ~57% more for ~50% more throughput on bandwidth-bound workloads, so cost-per-token actually rises by a few percent. The upgrade does not pay for itself on raw economics. It pays for itself on indirect benefits: extra VRAM enables previously-impossible workloads, lower TTFT helps user-facing UX metrics, and FP4 readiness future-proofs the deployment for 2026-era models. See the monthly hosting cost guide for the full TCO breakdown and the vs 5090 spec deep-dive.

When the upgrade justifies itself

You need Qwen 2.5 32B FP8 with full KV at production batch sizes
You need 128k-context inference at low latency
You serve consumer-facing chat where TTFT below 250ms is a hard SLA
Your 4090 hits aggregate throughput ceiling at p95 traffic and you want headroom not horizontal scale
You are evaluating FP4 quantised models (Blackwell native, Ada cannot)
You are sizing for a 2-3 year deployment and want bandwidth headroom for the next model release
Your image generation pipeline (SDXL, Flux, ComfyUI) blocks the LLM during peak load
You run Mixtral 8x7B and need full KV without paged-attention pressure

If none of these apply and your workload is purely 8B FP8 chat or 70B AWQ INT4, stay on the 4090 – cost-per-token is meaningfully better.

Things to try before upgrading

Before you spend an extra £3,900/year, run through this list. Each item is one config change away.

FP8 KV cache. Set --kv-cache-dtype fp8. Doubles your effective KV memory at sub-1% quality cost. Documented in the vLLM setup guide.
Speculative decoding. A 1B draft model in front of an 8B target model can lift decode 1.5-2x at <5% extra VRAM.
Prefix caching. RAG workloads with stable system prompts see 30-60% TTFT reduction.
Move embeddings off the LLM card. A £75/month 5060 Ti hosts BGE-M3 – see the hybrid pairing.
Reduce --max-model-len. 128k allocations leak even when most requests are 4k.
Increase --gpu-memory-utilization. Default 0.90 leaves ~2.4GB unused on a 4090.

Migration checklist

# vLLM launch on Blackwell GB202 (5090)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct-FP8 \
  --quantization fp8 --kv-cache-dtype fp8 \
  --max-model-len 32768 --max-num-seqs 16 \
  --gpu-memory-utilization 0.94

Confirm vLLM/SGLang/TensorRT-LLM versions support Blackwell sm_120 (vLLM 0.6.4+, TensorRT-LLM 0.13+, SGLang 0.4+)
Ensure CUDA 12.8 or later in your container images – 12.4 will boot but run slow paths
Re-quantise FP4 models if you want to use the new format – existing FP8 weights still run
Re-benchmark KV cache settings – 32GB allows larger --gpu-memory-utilization and bigger --max-num-seqs
Update PCIe Gen5-aware NIC drivers if your host supports them
Validate power: the host PSU needs headroom for 575W sustained plus host overhead – 1000W minimum, 1200W recommended
Verify chassis cooling – the 5090 reference design exhausts hotter than the 4090 and benefits from rear-mounted exhaust fans

Production gotchas

Driver branch confusion. Blackwell needs NVIDIA driver 555+. Some LTS distros pin to 535. Check with nvidia-smi before deploying.
FP4 weight format is not interchangeable. An FP4-quantised Llama 8B will not load on a 4090. Keep both checkpoints during a phased rollout.
vLLM 0.6.3 silent fallback. Earlier vLLM runs but uses sm_90 paths on Blackwell. Throughput is ~30% below spec. Always pin 0.6.4+.
Power excursions. 575W TDP is sustained, transients hit 600W+. Cheap PSUs trip on the spike even with adequate average headroom.
FlashAttention 3 required. FA2 works but FA3 unlocks the Blackwell speed-ups. Pin flash-attn>=3.0.
Container CUDA mismatch. NGC containers on CUDA 12.4 install fine but cuBLAS Blackwell kernels are missing. Rebuild on 12.8.
KV reservation surprise. 32GB cards encourage --max-num-seqs 32 defaults that worked on H100 but starve activation buffers on a single 5090.

Payback timeline and verdict

For pure-throughput workloads the 5090 does not pay for itself – cost-per-token is slightly worse. The upgrade earns its keep on indirect economics:

Capacity unlock – serving traffic the 4090 turned away. Payback is immediate at the moment you unblock revenue.
UX threshold – if a sub-250ms TTFT is contractually required, the 5090 is the cheapest way to hit it on a single card.
Roadmap insurance – 2026-era frontier models will increasingly target FP4 and 32k-128k native context. The 5090 carries the deployment forward 18-24 months without another forklift.

Verdict. Upgrade if VRAM pressure, TTFT SLA, or FP4 roadmap pressure is real. Stay on the 4090 if cost-per-token is the dominant metric and your model menu is stable in the 8B-14B FP8 / 70B AWQ range. The decision logic is split out in the 4090-or-5090 decision post.

The proven workhorse, before you scale up

The RTX 4090 24GB still delivers best-in-class cost-per-token for FP8 chat. UK dedicated hosting.

Order the RTX 4090 24GB

Upgrading From RTX 4090 24GB to RTX 5090 32GB

Contents

Spec delta: Ada AD102 vs Blackwell GB202

Throughput uplift across workloads

VRAM headroom: 24GB to 32GB

Cost differential and per-token economics

When the upgrade justifies itself

Things to try before upgrading

Migration checklist

Production gotchas

Payback timeline and verdict

The proven workhorse, before you scale up

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Upgrading From RTX 4090 24GB to RTX 5090 32GB

Contents

Spec delta: Ada AD102 vs Blackwell GB202

Throughput uplift across workloads

VRAM headroom: 24GB to 32GB

Cost differential and per-token economics

When the upgrade justifies itself

Things to try before upgrading

Migration checklist

Production gotchas

Payback timeline and verdict

The proven workhorse, before you scale up

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 5060 Ti 16GB or 4060 Ti 16GB – Decision

Self-Hosted vs Azure AI Foundry 2026

Hidden Costs of Google Vertex for European Companies

Best Vast.ai Alternatives for GPU Rental

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?