RTX 3050 - Order Now
Home / Blog / Alternatives / RTX 4090 24GB or RTX 5090 32GB: Decision Guide
Alternatives

RTX 4090 24GB or RTX 5090 32GB: Decision Guide

Newer Blackwell with 32GB GDDR7 versus the proven 24GB Ada workhorse: a per-pound performance, per-watt efficiency, and per-workload winner table for choosing the right card.

The RTX 5090 32GB brings Blackwell’s GB202 die, GDDR7, and 1.79 TB/s of bandwidth – the most aggressive consumer-class GPU NVIDIA has shipped to date. The RTX 4090 24GB is a known quantity with mature tooling, native 4th-gen FP8, and a price point UK teams already budget for. The 5090 wins on raw throughput and on VRAM headroom; the 4090 wins on £/throughput and on tooling stability. This decision guide pits them head to head with concrete throughput numbers, watts-per-token efficiency, a per-pound metric, and a 10-workload winner table, anchored to dedicated 4090 hosting at GigaGPU. Both cards live in the wider UK GPU range.

Contents

Spec delta

SpecRTX 4090 24GBRTX 5090 32GBDelta
ArchitectureAda AD102Blackwell GB2022 generations
CUDA cores16,38421,760+33%
Tensor cores512 (4th gen)680 (5th gen)+33%, FP4 native
VRAM24GB GDDR6X32GB GDDR7+33%
Bandwidth1,008 GB/s1,792 GB/s+78%
TDP450W575W+28%
FP8 generation4th gen5th genFaster matmul + FP4
FP4 nativeNoYesNew format support
NVLinkNoNo (PCIe Gen5 x16)5090 has Gen5 (2x lane bandwidth)
FP16 TFLOPS dense165~280+70%
Approx UK dedicated £/mo£550£900+£350/mo, +64%

Throughput comparison

Sustained vLLM throughput at batch 1 and at typical concurrency. The 5090’s bigger memory bandwidth does most of the work for inference – LLM decode is bandwidth-bound. The +33% CUDA cores helps batching and prefill. The 5th-gen tensor cores bring small per-op efficiency gains on FP8.

Workload4090 t/s5090 t/sSpeedup
Llama 3.1 8B FP8 batch 1198~2801.41x
Llama 3.1 8B FP8 concurrency 8~1,100 aggr~1,650 aggr1.50x
Llama 3.1 8B FP8 concurrency 32~1,800 aggr~2,800 aggr1.56x
Llama 3.1 70B AWQ INT4 batch 122~361.64x
Llama 3.1 70B AWQ INT4 conc 4~110 aggr~180 aggr1.64x
Llama 3.1 70B FP8 batch 1OOM at 24GB~30 (32GB tight)n/a
Qwen 2.5 32B FP8OOM tight~85n/a
SDXL 1024×1024, 30 steps3.4s2.1s1.62x
Flux.1 Dev 1024×102414s8.5s1.65x
Whisper Large v3, 1hr audio22s14s1.57x

VRAM headroom: 24GB vs 32GB

The extra 8GB on the 5090 unlocks specific workloads. Most importantly: Llama 70B FP8 (~38GB nominal but with paged KV and tight quantisation it can squeeze on 32GB at low concurrency) becomes marginally feasible. Qwen 32B in FP8 fits with full KV. Mixtral 8x7B AWQ moves from “tight” on 4090 to “comfortable” on 5090. 128k context windows on 8B become realistic.

Model4090 24GB5090 32GB
Llama 8B FP8 (4k context)16GB free for KV24GB free for KV (3x context)
Llama 8B FP8 (128k context)Tight, FP8 KV neededComfortable
Llama 70B AWQ INT4Fits, FP8 KV neededComfortable, FP16 KV OK
Llama 70B FP8OOM (~38GB)Tight but possible at conc 1-2
Qwen 32B FP8~32GB – OOMFits
Mixtral 8x7B AWQ~25GB – swap riskFits cleanly
Flux.1 Dev BF16Fits with offloadFits cleanly
Llama 405B AWQ INT4OOMOOM

Per-pound and per-watt performance

Assume £550/month for a dedicated 4090 and ~£900/month for a dedicated 5090. The 5090 gives ~1.5x the inference throughput at ~1.64x the price – so per-pound the 4090 still leads on raw t/s/£ on bandwidth-bound workloads. The 5090 wins on per-watt efficiency at sustained throughput because the GDDR7 is meaningfully more efficient per byte transferred.

Metric40905090Winner
Llama 8B FP8 t/s per £/mo2.001.834090
Llama 70B INT4 t/s per £/mo0.200.20Tied
SDXL £/image (24/7 queue)£0.0009£0.00104090
Llama 8B FP8 t/s per W (TDP)0.440.495090
Llama 70B INT4 t/s per W0.0490.0635090
Workloads that fit on one cardMost 24GB-class32GB-class too5090

Per-workload winner (10 workloads)

Workload4090 wins5090 winsWhy
Llama 8B FP8 chat API conc 8YesNo£/token favours 4090
Llama 70B AWQ INT4 single cardYesNoTied per token, 4090 cheaper
Llama 70B FP8 single cardNoYesOnly 5090 fits
Qwen 32B FP8NoYesOnly 5090 fits
SDXL 24/7 image queueYesNo£/image favours 4090
Flux.1 Dev image genMarginalYes5090 1.65x faster, less offload needed
Sub-100ms TTFT chat at high concNoYes5090 latency advantage
FP4 quantised inferenceNoYes4090 cannot do native FP4
128k context Llama 8BTightYes5090 KV headroom
3-year deployment, future-proofNoYesBlackwell longer support window

FP4 native: what Blackwell unlocks

The 5090 has hardware support for FP4 (E2M1 and E3M0 formats). For inference, FP4 weights halve memory footprint vs FP8 with quality loss roughly comparable to AWQ INT4 – but with much faster matmul because the tensor cores execute it natively. Llama 70B in FP4 fits in ~22GB of weights, well within the 5090’s 32GB. The 4090 has no native FP4 path – emulation through INT4 kernels exists but loses the architectural advantage.

As of 2026 the FP4 ecosystem is still maturing. vLLM, TensorRT-LLM and SGLang have FP4 paths but with more limited model coverage than FP8. If your roadmap includes adopting FP4 in 2026-2027, the 5090 is the obvious target. If you’re optimising for known-good FP8 today, the 4090 is the safer bet.

Production gotchas

  1. 5090 needs CUDA 12.8+ and recent inference stacks. vLLM 0.6+, TensorRT-LLM 0.13+, SGLang 0.4+. Older container images will not detect sm_120 correctly.
  2. 575W power envelope. 5090 host PSU needs 1000W+ headroom. Many 1U/2U dedicated chassis cannot accommodate it – confirm with hosting provider.
  3. 5090 supply remains tight in 2026. Capacity at most providers is rationed. Lead times for dedicated 5090 hosting can exceed a week.
  4. FP4 tooling immaturity. The format is supported but not all model variants ship with FP4 quantised weights yet. Check model availability before betting on FP4 throughput.
  5. Cooling at 575W in dense racks. Sustained inference loads pull peak TDP. Inadequate airflow throttles to ~480W and loses 15-20% throughput.
  6. 4090 mature but EOL approaching. NVIDIA driver support continues but new architectural features ship Blackwell-first. Plan for a 2-3 year operational window on Ada.
  7. PCIe Gen5 advantage rarely matters. Most inference workloads do not saturate PCIe Gen4 x16 (32 GB/s each direction). Gen5’s doubling helps mainly multi-GPU training and very large prefill batches.

Verdict and when each card wins

For pure cost-efficiency on workloads that already fit on 24GB (Llama 8B FP8 chat APIs, Qwen 14B, Mistral 7B, Llama 70B AWQ INT4, SDXL), the 4090 stays the better buy in 2026 – cost-per-token wins by roughly 8-10%. For 32GB-class models (Llama 70B FP8 single-card, Qwen 32B FP8, Mixtral 8x7B comfortable, 128k context windows), FP4 experimentation, sub-100ms TTFT requirements, or a 2-3 year deployment that wants Blackwell’s longer support window, the 5090 justifies the £350/mo premium. If you cannot decide, the 4090 is the safer financial choice today and the 5090 is the better long-term bet. Many teams run both: 4090s for the cost-bound 8B chat fleet and a 5090 or two for the larger-model tier. Order via GigaGPU dedicated hosting.

Proven Ada workhorse

24GB GDDR6X, native 4th-gen FP8, mature tooling, best £/token for 24GB-class workloads. UK dedicated hosting.

Order the RTX 4090 24GB

See also: 4090 vs 5090 spec deep-dive, upgrade path to 5090, when to upgrade, FP8 tensor cores, spec breakdown, tier positioning 2026, Llama 8B benchmark, Llama 70B INT4 benchmark, tokens per watt, or 3090 decision, or 5080 decision, or 5060 Ti decision, multi-card pairing, vs cloud H100, 5090 vs 3090, power draw efficiency, 70B INT4 VRAM.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?