RTX 3050 - Order Now
Home / Blog / Alternatives / Upgrading From RTX 4090 24GB to RTX 5090 32GB
Alternatives

Upgrading From RTX 4090 24GB to RTX 5090 32GB

Spec delta, real-world inference uplift, VRAM headroom, cost differential, payback timeline, and the migration checklist for moving from Ada AD102 to Blackwell GB202.

If your RTX 4090 24GB deployment is hitting throughput or VRAM ceilings, the RTX 5090 32GB is the natural next step on the consumer-derived inference ladder. Same form factor, same operational pattern, dramatically more bandwidth, native FP4 tensor cores, and a 33% bigger VRAM ceiling that turns several “tight” workloads into “comfortable” ones. This guide lays out the spec delta, the real-world throughput uplift across LLM and diffusion workloads, the price differential, the payback timeline, and the migration checklist – in the context of the wider UK GPU range.

Contents

Spec delta: Ada AD102 vs Blackwell GB202

The 5090 is two architectural generations on from the 4090. Most of the inference uplift comes from memory bandwidth, with FP4 native tensor cores and PCIe Gen5 thrown in for forward-compatibility.

SpecRTX 4090 24GBRTX 5090 32GBChange
ArchitectureAda AD102Blackwell GB202+2 generations
CUDA cores16,38421,760+33%
Tensor cores512 (4th gen)680 (5th gen)+33%, FP4 native
VRAM24GB GDDR6X32GB GDDR7+8GB (+33%)
Bandwidth1,008 GB/s1,792 GB/s+78%
TDP450W575W+125W (+28%)
PCIeGen4 x16Gen5 x162x lane bandwidth
FP16 TFLOPS165~280+70%
FP8 TFLOPS (sparse)~660~1,100+67%
FP4 supportNoYesNew format path
NVLinkNoNoUnchanged

The bandwidth jump is the headline number for inference. Decode is bandwidth-bound on every transformer LLM, so a 78% bigger pipe to GDDR7 translates almost linearly into per-token throughput on small models and scales further on aggregate batch.

Throughput uplift across workloads

Across LLM, diffusion, and audio workloads the 5090 lands between 1.4x and 1.7x the 4090. The exact uplift depends on whether the workload is bandwidth-bound (LLM decode), compute-bound (SDXL, prefill), or mixed.

Workload4090 t/s or s/img5090 t/s or s/imgUplift
Llama 3.1 8B FP8 batch 1198 t/s280 t/s+41%
Llama 3.1 8B FP8 aggregate batch 321,100 t/s1,700 t/s+55%
Llama 3.1 70B AWQ INT4 batch 122 t/s36 t/s+64%
Llama 3.1 70B AWQ INT4 concurrency 4~75 aggr~125 aggr+67%
Llama 3.1 70B FP8 batch 1OOM~30 (tight)n/a
Qwen 2.5 14B FP8 batch 1120 t/s175 t/s+46%
Mixtral 8x7B AWQ batch 178 t/s120 t/s+54%
SDXL 1024×1024 30 steps3.4s2.1s+62% throughput
Flux.1 Dev 1024×102414s8.5s+65%
Whisper Large v3 1hr audio22s14s+57%
Llama 8B FP8 t/J (efficiency)3.43.4Tied

Tokens-per-joule stay roughly flat – the 5090 burns +28% more power for +44% more throughput on small models, so efficiency is a wash. This matters for the tokens-per-watt calculation but rarely changes the upgrade decision.

VRAM headroom: 24GB to 32GB

The extra 8GB looks modest on paper but unlocks specific workloads at the boundary. Llama 70B FP8 (38GB) still does not fit on a single 5090, but everything in the 24-32GB band – Qwen 32B FP8, Mixtral 8x7B AWQ with full KV, 128k-context 8B – moves from “tight or OOM” to “comfortable”.

Workload4090 24GB5090 32GB
Llama 8B FP8 + KV16GB free for KV (~32k context)24GB free for KV (~96k context)
Llama 70B AWQ INT4FP8 KV needed, batch 1-2FP16 KV OK, batch 4 comfortable
Llama 70B FP8OOM (38GB needed)OOM (still 6GB short)
Qwen 2.5 32B FP8Tight, may OOM at concurrencyFits with KV headroom
Mixtral 8x7B AWQ INT4~25GB – swap riskFits comfortably
128k context Llama 3.1 8BTight – paged-attention pressureComfortable
SDXL + ControlNet + IP-AdapterTightComfortable

If your symptom is purely VRAM (“the model does not fit”), the 5090 only solves it for the 24-32GB tier. Anything north of 32GB is RTX 6000 Pro 96GB or H100 80GB territory – covered in the 6000 Pro upgrade post.

Cost differential and per-token economics

Indicative UK dedicated hosting in 2026:

Card£/month£/yearThroughput uplift£/M tokens (8B FP8)
RTX 4090 24GB£575£6,900baseline£0.039
RTX 5090 32GB£900£10,800+44-55%£0.041
Delta+£325/mo+£3,900/yr+50% headline+5%

The 5090 costs ~57% more for ~50% more throughput on bandwidth-bound workloads, so cost-per-token actually rises by a few percent. The upgrade does not pay for itself on raw economics. It pays for itself on indirect benefits: extra VRAM enables previously-impossible workloads, lower TTFT helps user-facing UX metrics, and FP4 readiness future-proofs the deployment for 2026-era models. See the monthly hosting cost guide for the full TCO breakdown and the vs 5090 spec deep-dive.

When the upgrade justifies itself

  • You need Qwen 2.5 32B FP8 with full KV at production batch sizes
  • You need 128k-context inference at low latency
  • You serve consumer-facing chat where TTFT below 250ms is a hard SLA
  • Your 4090 hits aggregate throughput ceiling at p95 traffic and you want headroom not horizontal scale
  • You are evaluating FP4 quantised models (Blackwell native, Ada cannot)
  • You are sizing for a 2-3 year deployment and want bandwidth headroom for the next model release
  • Your image generation pipeline (SDXL, Flux, ComfyUI) blocks the LLM during peak load
  • You run Mixtral 8x7B and need full KV without paged-attention pressure

If none of these apply and your workload is purely 8B FP8 chat or 70B AWQ INT4, stay on the 4090 – cost-per-token is meaningfully better.

Things to try before upgrading

Before you spend an extra £3,900/year, run through this list. Each item is one config change away.

  1. FP8 KV cache. Set --kv-cache-dtype fp8. Doubles your effective KV memory at sub-1% quality cost. Documented in the vLLM setup guide.
  2. Speculative decoding. A 1B draft model in front of an 8B target model can lift decode 1.5-2x at <5% extra VRAM.
  3. Prefix caching. RAG workloads with stable system prompts see 30-60% TTFT reduction.
  4. Move embeddings off the LLM card. A £75/month 5060 Ti hosts BGE-M3 – see the hybrid pairing.
  5. Reduce --max-model-len. 128k allocations leak even when most requests are 4k.
  6. Increase --gpu-memory-utilization. Default 0.90 leaves ~2.4GB unused on a 4090.

Migration checklist

# vLLM launch on Blackwell GB202 (5090)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct-FP8 \
  --quantization fp8 --kv-cache-dtype fp8 \
  --max-model-len 32768 --max-num-seqs 16 \
  --gpu-memory-utilization 0.94
  1. Confirm vLLM/SGLang/TensorRT-LLM versions support Blackwell sm_120 (vLLM 0.6.4+, TensorRT-LLM 0.13+, SGLang 0.4+)
  2. Ensure CUDA 12.8 or later in your container images – 12.4 will boot but run slow paths
  3. Re-quantise FP4 models if you want to use the new format – existing FP8 weights still run
  4. Re-benchmark KV cache settings – 32GB allows larger --gpu-memory-utilization and bigger --max-num-seqs
  5. Update PCIe Gen5-aware NIC drivers if your host supports them
  6. Validate power: the host PSU needs headroom for 575W sustained plus host overhead – 1000W minimum, 1200W recommended
  7. Verify chassis cooling – the 5090 reference design exhausts hotter than the 4090 and benefits from rear-mounted exhaust fans

Production gotchas

  1. Driver branch confusion. Blackwell needs NVIDIA driver 555+. Some LTS distros pin to 535. Check with nvidia-smi before deploying.
  2. FP4 weight format is not interchangeable. An FP4-quantised Llama 8B will not load on a 4090. Keep both checkpoints during a phased rollout.
  3. vLLM 0.6.3 silent fallback. Earlier vLLM runs but uses sm_90 paths on Blackwell. Throughput is ~30% below spec. Always pin 0.6.4+.
  4. Power excursions. 575W TDP is sustained, transients hit 600W+. Cheap PSUs trip on the spike even with adequate average headroom.
  5. FlashAttention 3 required. FA2 works but FA3 unlocks the Blackwell speed-ups. Pin flash-attn>=3.0.
  6. Container CUDA mismatch. NGC containers on CUDA 12.4 install fine but cuBLAS Blackwell kernels are missing. Rebuild on 12.8.
  7. KV reservation surprise. 32GB cards encourage --max-num-seqs 32 defaults that worked on H100 but starve activation buffers on a single 5090.

Payback timeline and verdict

For pure-throughput workloads the 5090 does not pay for itself – cost-per-token is slightly worse. The upgrade earns its keep on indirect economics:

  • Capacity unlock – serving traffic the 4090 turned away. Payback is immediate at the moment you unblock revenue.
  • UX threshold – if a sub-250ms TTFT is contractually required, the 5090 is the cheapest way to hit it on a single card.
  • Roadmap insurance – 2026-era frontier models will increasingly target FP4 and 32k-128k native context. The 5090 carries the deployment forward 18-24 months without another forklift.

Verdict. Upgrade if VRAM pressure, TTFT SLA, or FP4 roadmap pressure is real. Stay on the 4090 if cost-per-token is the dominant metric and your model menu is stable in the 8B-14B FP8 / 70B AWQ range. The decision logic is split out in the 4090-or-5090 decision post.

The proven workhorse, before you scale up

The RTX 4090 24GB still delivers best-in-class cost-per-token for FP8 chat. UK dedicated hosting.

Order the RTX 4090 24GB

See also: 4090 vs 5090 decision, spec deep-dive, when to upgrade, upgrade to 6000 Pro, spec breakdown, tier positioning 2026, FP8 Llama deployment.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?