RTX 3050 - Order Now
Home / Blog / GPU Comparisons / RTX 4090 24GB vs RTX 5090 32GB: Ada vs Blackwell Throughput Uplift
GPU Comparisons

RTX 4090 24GB vs RTX 5090 32GB: Ada vs Blackwell Throughput Uplift

How Blackwell's 5th-gen tensor cores, 1.79 TB/s GDDR7 bandwidth, FP4 path and 32GB VRAM compare against the Ada-based RTX 4090 24GB across LLM, diffusion and fine-tuning workloads — with per-workload winners, per-pound and per-watt tables.

The RTX 5090 is Blackwell’s consumer flagship and the natural successor to the RTX 4090. If you are choosing between the two for AI inference on UK GPU hosting, you are weighing 33% more VRAM, 78% more memory bandwidth, native FP4, 5th-generation tensor cores and PCIe Gen 5 against a more proven, lower-power Ada platform with two extra years of toolchain maturity. The RTX 4090 24GB remains the value pick for many production workloads — this post walks through exactly which workloads pay the 5090 premium back, which do not, and where 32GB unlocks behaviour the 4090 simply cannot match without sharding.

Contents

Spec sheet side by side

Blackwell is not a clean rebuild of Ada in the way Ada was of Ampere. NVIDIA reused the basic SM block diagram and concentrated its silicon budget on the tensor cores, the memory subsystem and a more aggressive Transformer Engine. The result is a card that scales primarily with bandwidth and tensor-core count rather than CUDA-core count.

SpecRTX 4090 (Ada AD102)RTX 5090 (Blackwell GB202)Delta
Process nodeTSMC 4NTSMC 4NPRefined node
SM count128170+33%
CUDA cores16,38421,760+33%
Tensor cores512 (4th gen)680 (5th gen)+33%, FP4 added
RT cores128 (3rd gen)170 (4th gen)+33%
Boost clock2.52 GHz2.41 GHz-4% (more SMs)
VRAM24 GB GDDR6X (21 Gbps)32 GB GDDR7 (28 Gbps)+33% capacity
Memory bandwidth1008 GB/s1792 GB/s+78%
Memory bus384-bit512-bit+33%
L2 cache72 MB~96 MB+33%
FP16 dense TFLOPS165~209+27%
FP8 dense TFLOPS660 (with sparsity)~838+27%
FP4 dense TFLOPSNone~1676New format
TDP450W575W+28%
PCIeGen 4 x16Gen 5 x162x effective
NVENC / NVDEC2x 8th + 1x 5th3x 9th + 2x 6thMore streams

Bandwidth is the headline. Going from 21 Gbps GDDR6X on a 384-bit bus to 28 Gbps GDDR7 on a 512-bit bus pushes the 5090 to 1.79 TB/s — close to A100 PCIe territory and almost double the 4090. For decode-bound LLM inference where the bottleneck is reading weights from VRAM into the tensor cores, that uplift translates directly into single-stream tokens per second. Compare against the RTX 4090 spec breakdown for full context.

5th-gen vs 4th-gen tensor cores and FP4

4th-gen tensor cores on Ada introduced FP8 (E4M3 and E5M2). 5th-gen on Blackwell keeps both and adds FP4 (E2M1 and a microscaling MX-FP4 variant), doubling the theoretical throughput once again for any operator that tolerates 4-bit precision. The catch is that FP4 inference quality is sensitive to calibration — Llama 3 8B in MX-FP4 holds within 0.5 points of FP8 on MMLU, but Qwen 2.5 32B drops 1.5-2.0 points. For coding workloads where exactness matters (Qwen Coder, DeepSeek Coder) FP8 remains the safer default even on Blackwell. The Transformer Engine in CUDA 12.5+ on Blackwell silently mixes FP8 and FP4 per layer, but you should validate against your eval suite rather than trust the default.

Operationally: FP8 on Ada equals FP8 on Blackwell at the same model — the 5090 is faster mostly because it has more tensor cores and far more bandwidth, not because Blackwell does FP8 differently. The FP4 path is the genuinely new capability, useful chiefly for the largest models where 8-bit weights cannot fit. See FP8 tensor cores on Ada for the architectural counterpart.

Bandwidth, GDDR7 and the cache hierarchy

Single-stream LLM decode is bandwidth-bound: at batch 1, almost every token requires reading the model weights from VRAM. On Llama 3 8B FP8 the 4090’s 1008 GB/s sustains ~198 t/s; the 5090’s 1792 GB/s sustains ~280 t/s — a 1.41x uplift, almost exactly tracking the bandwidth ratio (1.78x reduced by the L2 absorbing the rest). On batched inference where compute matters more, the gap closes: at batch 32 aggregate throughput is roughly 1100 t/s on the 4090 vs 1500 t/s on the 5090 (1.36x), and on prefill-heavy RAG it can drop to 1.25x.

Throughput uplift across nine workloads

WorkloadRTX 4090RTX 5090Uplift
Llama 3.1 8B FP8 decode b1198 t/s280 t/s1.41x
Llama 3.1 8B FP8 batch 32 agg1100 t/s1500 t/s1.36x
Llama 3.1 70B AWQ INT4 decode b122-24 t/s34-37 t/s1.55x
Llama 3.1 70B FP8 (does not fit on 4090)n/a27 t/sn/a
Qwen 2.5 32B AWQ decode b165 t/s92 t/s1.42x
SDXL 1024×1024 30-step2.0s1.4s1.43x
FLUX.1-dev FP8 30-step4.1s2.7s1.52x
Whisper large-v3-turbo INT880x RT130x RT1.63x
QLoRA Llama 8B (steps/s)2.63.71.42x

Across these nine workloads the 5090 is consistently 1.35-1.65x the 4090. The big exception is anything that needs 70B at FP8: the 4090 simply cannot fit it, while the 5090 can with 8k context. See the Llama 70B INT4 benchmark for the AWQ baseline and the Llama 3 8B benchmark for the 8B numbers.

When 32GB unlocks new workloads

WorkloadRTX 4090 24GBRTX 5090 32GB
Llama 3.1 70B AWQ INT4 + 16k FP8 KVTight, –max-num-seqs 4Comfortable, 32k context
Llama 3.1 70B FP8 (35 GB weights)OOMFits with 8k context
Mixtral 8x22B AWQ (74 GB)OOMOOM
Qwen 2.5 72B AWQ INT4Tight, 4k contextFits, 32k context
SDXL training LoRA + Refiner cachedTightComfortable
FLUX.1-dev FP16 (22 GB peak)RiskyComfortable
Llama Vision 11B + KV at 32kTightComfortable

The four workloads where 32GB genuinely matters are: Llama 70B at FP8 (instead of AWQ), Qwen 72B at long context, FLUX in full FP16 with caching, and any large-context vision model. For everything else, the 24GB on the 4090 is sufficient.

Power, economics and tokens-per-watt

MetricRTX 4090RTX 5090
TDP450W575W
Sustained LLM batch 32 power360W460W
Aggregate t/s on Llama 3 8B FP8 b3211001500
Tokens/Joule3.053.26
UK price (typical 2026)£1,300£2,100
£/aggregate t/s (b32)£1.18£1.40
£/decode t/s (b1)£6.57£7.50
Annual electricity @ 24/7 £0.18/kWh£568£725

The per-pound numbers favour the 4090. Per-watt slightly favours the 5090. For a fleet of ten cards serving a multi-tenant SaaS, the 4090 is currently 18-25% cheaper per delivered token. The 5090 wins decisively only when you need the VRAM. See the tokens-per-watt and vs OpenAI API cost analyses.

Per-workload winner table

WorkloadWinnerWhy
200-MAU SaaS RAG on Llama 8B4090Better £/perf, plenty of headroom
12-engineer Qwen Coder 32B AWQ team409065 t/s exceeds typing speed already
Single-tenant Llama 70B at FP850904090 cannot fit FP8
Llama 70B AWQ INT4 with 32k context50904090 caps at 16k
SDXL studio, 500 imgs/day40902.0s suffices, lower power
FLUX.1-dev production at scale50901.5x faster, 32GB headroom
Voice agent (Whisper + TTS)409080x RT is overkill already
Multi-tenant 70B FP8 endpoint5090VRAM unlocks single-card serve
Tokens-per-watt-optimised hosting5090Marginally better t/J
Capex-constrained startup409040% cheaper, 70% the throughput

vLLM serving on each card

The configurations look almost identical because vLLM abstracts the differences. The 5090 commands let you crank --max-num-seqs up and use longer contexts.

# RTX 4090 — Llama 3.1 8B FP8, 16k context, 32-way batching
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 16384 --max-num-seqs 32 \
  --gpu-memory-utilization 0.92
# RTX 5090 — same model, 64k context, 64-way batching
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 65536 --max-num-seqs 64 \
  --gpu-memory-utilization 0.92
# RTX 5090 only — Llama 3.1 70B at FP8 (does not fit on 4090)
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 8192 --max-num-seqs 4 \
  --gpu-memory-utilization 0.94

Production gotchas with the 5090

  • 575W power draw is a chassis problem. Not every 4U server with 4090 cooling can keep a 5090 under thermal limits at sustained load. Expect to see throttling on any chassis without dedicated 5090 airflow design.
  • 12V-2×6 connectors only. The older 12VHPWR is technically compatible but the seating issues that plagued early 4090 reports recur. Use 12V-2×6 cables and seat them firmly.
  • PCIe Gen 5 only matters for multi-card. Single-card inference does not saturate Gen 4 x16. The benefit appears only when you do tensor-parallel inference or NCCL all-reduce across cards.
  • FP4 calibration is non-trivial. Do not just flip --quantization fp4 in production without an eval pass. Some models lose 2-3 MMLU points.
  • Driver 555+ required. Older driver branches do not expose 5th-gen tensor instructions correctly. Pin your container base image accordingly.
  • NVENC 9th gen is great but not yet matched in FFmpeg releases. If you do video transcoding alongside inference, check that your FFmpeg build includes 9th-gen NVENC support.
  • Resale risk. The 5090’s high price means depreciation is steeper than the 4090, which is already at its second-hand floor. Consider this when sizing capex.

Which to pick

  • Pick the 4090 24GB if your model fits in 24GB at FP8 or AWQ; you are price-sensitive; you serve fewer than 100 concurrent users; or you want the lowest £/delivered-token. See the 4090 or 5090 decision guide.
  • Pick the 5090 32GB if you need 70B+ at FP8 on a single card; you need 32k+ context on Qwen 72B; you want FP4 for the largest models; or you are building a long-tenure rack where the 1.4x throughput compounds.
  • Pick neither if you actually need 96GB on one card — go to RTX 6000 Pro 96GB or an H100 80GB.

For a 200-MAU SaaS RAG on Llama 8B FP8, the 4090 is the clear choice. For a 70B-FP8 internal endpoint where capex is amortised over 24 months, the 5090 wins. For a 12-engineer coding team on Qwen 32B AWQ, both work, but the 4090 saves £800 per node.

Provision a 4090 today, evaluate the 5090 in three months

GigaGPU’s UK dedicated hosting offers the RTX 4090 24GB on properly cooled, FP8-ready images. Start serving today; migrate to a 5090 when you actually need the VRAM.

Order the RTX 4090 24GB

See also: RTX 4090 spec breakdown, 2026 tier positioning, FP8 tensor cores on Ada, 4090 or 5090 decision, when to upgrade, Llama 70B INT4 deployment, RTX 4090 vs RTX 3090.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?