RTX 4090 24GB vs RTX 5090 32GB: Ada vs Blackwell Throughput Uplift GIGAGPU

The RTX 5090 is Blackwell’s consumer flagship and the natural successor to the RTX 4090. If you are choosing between the two for AI inference on UK GPU hosting, you are weighing 33% more VRAM, 78% more memory bandwidth, native FP4, 5th-generation tensor cores and PCIe Gen 5 against a more proven, lower-power Ada platform with two extra years of toolchain maturity. The RTX 4090 24GB remains the value pick for many production workloads — this post walks through exactly which workloads pay the 5090 premium back, which do not, and where 32GB unlocks behaviour the 4090 simply cannot match without sharding.

Spec sheet side by side

Blackwell is not a clean rebuild of Ada in the way Ada was of Ampere. NVIDIA reused the basic SM block diagram and concentrated its silicon budget on the tensor cores, the memory subsystem and a more aggressive Transformer Engine. The result is a card that scales primarily with bandwidth and tensor-core count rather than CUDA-core count.

Spec	RTX 4090 (Ada AD102)	RTX 5090 (Blackwell GB202)	Delta
Process node	TSMC 4N	TSMC 4NP	Refined node
SM count	128	170	+33%
CUDA cores	16,384	21,760	+33%
Tensor cores	512 (4th gen)	680 (5th gen)	+33%, FP4 added
RT cores	128 (3rd gen)	170 (4th gen)	+33%
Boost clock	2.52 GHz	2.41 GHz	-4% (more SMs)
VRAM	24 GB GDDR6X (21 Gbps)	32 GB GDDR7 (28 Gbps)	+33% capacity
Memory bandwidth	1008 GB/s	1792 GB/s	+78%
Memory bus	384-bit	512-bit	+33%
L2 cache	72 MB	~96 MB	+33%
FP16 dense TFLOPS	165	~209	+27%
FP8 dense TFLOPS	660 (with sparsity)	~838	+27%
FP4 dense TFLOPS	None	~1676	New format
TDP	450W	575W	+28%
PCIe	Gen 4 x16	Gen 5 x16	2x effective
NVENC / NVDEC	2x 8th + 1x 5th	3x 9th + 2x 6th	More streams

Bandwidth is the headline. Going from 21 Gbps GDDR6X on a 384-bit bus to 28 Gbps GDDR7 on a 512-bit bus pushes the 5090 to 1.79 TB/s — close to A100 PCIe territory and almost double the 4090. For decode-bound LLM inference where the bottleneck is reading weights from VRAM into the tensor cores, that uplift translates directly into single-stream tokens per second. Compare against the RTX 4090 spec breakdown for full context.

5th-gen vs 4th-gen tensor cores and FP4

4th-gen tensor cores on Ada introduced FP8 (E4M3 and E5M2). 5th-gen on Blackwell keeps both and adds FP4 (E2M1 and a microscaling MX-FP4 variant), doubling the theoretical throughput once again for any operator that tolerates 4-bit precision. The catch is that FP4 inference quality is sensitive to calibration — Llama 3 8B in MX-FP4 holds within 0.5 points of FP8 on MMLU, but Qwen 2.5 32B drops 1.5-2.0 points. For coding workloads where exactness matters (Qwen Coder, DeepSeek Coder) FP8 remains the safer default even on Blackwell. The Transformer Engine in CUDA 12.5+ on Blackwell silently mixes FP8 and FP4 per layer, but you should validate against your eval suite rather than trust the default.

Operationally: FP8 on Ada equals FP8 on Blackwell at the same model — the 5090 is faster mostly because it has more tensor cores and far more bandwidth, not because Blackwell does FP8 differently. The FP4 path is the genuinely new capability, useful chiefly for the largest models where 8-bit weights cannot fit. See FP8 tensor cores on Ada for the architectural counterpart.

Bandwidth, GDDR7 and the cache hierarchy

Single-stream LLM decode is bandwidth-bound: at batch 1, almost every token requires reading the model weights from VRAM. On Llama 3 8B FP8 the 4090’s 1008 GB/s sustains ~198 t/s; the 5090’s 1792 GB/s sustains ~280 t/s — a 1.41x uplift, almost exactly tracking the bandwidth ratio (1.78x reduced by the L2 absorbing the rest). On batched inference where compute matters more, the gap closes: at batch 32 aggregate throughput is roughly 1100 t/s on the 4090 vs 1500 t/s on the 5090 (1.36x), and on prefill-heavy RAG it can drop to 1.25x.

Throughput uplift across nine workloads

Workload	RTX 4090	RTX 5090	Uplift
Llama 3.1 8B FP8 decode b1	198 t/s	280 t/s	1.41x
Llama 3.1 8B FP8 batch 32 agg	1100 t/s	1500 t/s	1.36x
Llama 3.1 70B AWQ INT4 decode b1	22-24 t/s	34-37 t/s	1.55x
Llama 3.1 70B FP8 (does not fit on 4090)	n/a	27 t/s	n/a
Qwen 2.5 32B AWQ decode b1	65 t/s	92 t/s	1.42x
SDXL 1024×1024 30-step	2.0s	1.4s	1.43x
FLUX.1-dev FP8 30-step	4.1s	2.7s	1.52x
Whisper large-v3-turbo INT8	80x RT	130x RT	1.63x
QLoRA Llama 8B (steps/s)	2.6	3.7	1.42x

Across these nine workloads the 5090 is consistently 1.35-1.65x the 4090. The big exception is anything that needs 70B at FP8: the 4090 simply cannot fit it, while the 5090 can with 8k context. See the Llama 70B INT4 benchmark for the AWQ baseline and the Llama 3 8B benchmark for the 8B numbers.

When 32GB unlocks new workloads

Workload	RTX 4090 24GB	RTX 5090 32GB
Llama 3.1 70B AWQ INT4 + 16k FP8 KV	Tight, –max-num-seqs 4	Comfortable, 32k context
Llama 3.1 70B FP8 (35 GB weights)	OOM	Fits with 8k context
Mixtral 8x22B AWQ (74 GB)	OOM	OOM
Qwen 2.5 72B AWQ INT4	Tight, 4k context	Fits, 32k context
SDXL training LoRA + Refiner cached	Tight	Comfortable
FLUX.1-dev FP16 (22 GB peak)	Risky	Comfortable
Llama Vision 11B + KV at 32k	Tight	Comfortable

The four workloads where 32GB genuinely matters are: Llama 70B at FP8 (instead of AWQ), Qwen 72B at long context, FLUX in full FP16 with caching, and any large-context vision model. For everything else, the 24GB on the 4090 is sufficient.

Power, economics and tokens-per-watt

Metric	RTX 4090	RTX 5090
TDP	450W	575W
Sustained LLM batch 32 power	360W	460W
Aggregate t/s on Llama 3 8B FP8 b32	1100	1500
Tokens/Joule	3.05	3.26
UK price (typical 2026)	£1,300	£2,100
£/aggregate t/s (b32)	£1.18	£1.40
£/decode t/s (b1)	£6.57	£7.50
Annual electricity @ 24/7 £0.18/kWh	£568	£725

The per-pound numbers favour the 4090. Per-watt slightly favours the 5090. For a fleet of ten cards serving a multi-tenant SaaS, the 4090 is currently 18-25% cheaper per delivered token. The 5090 wins decisively only when you need the VRAM. See the tokens-per-watt and vs OpenAI API cost analyses.

Per-workload winner table

Workload	Winner	Why
200-MAU SaaS RAG on Llama 8B	4090	Better £/perf, plenty of headroom
12-engineer Qwen Coder 32B AWQ team	4090	65 t/s exceeds typing speed already
Single-tenant Llama 70B at FP8	5090	4090 cannot fit FP8
Llama 70B AWQ INT4 with 32k context	5090	4090 caps at 16k
SDXL studio, 500 imgs/day	4090	2.0s suffices, lower power
FLUX.1-dev production at scale	5090	1.5x faster, 32GB headroom
Voice agent (Whisper + TTS)	4090	80x RT is overkill already
Multi-tenant 70B FP8 endpoint	5090	VRAM unlocks single-card serve
Tokens-per-watt-optimised hosting	5090	Marginally better t/J
Capex-constrained startup	4090	40% cheaper, 70% the throughput

vLLM serving on each card

The configurations look almost identical because vLLM abstracts the differences. The 5090 commands let you crank --max-num-seqs up and use longer contexts.

# RTX 4090 — Llama 3.1 8B FP8, 16k context, 32-way batching
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 16384 --max-num-seqs 32 \
  --gpu-memory-utilization 0.92

# RTX 5090 — same model, 64k context, 64-way batching
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 65536 --max-num-seqs 64 \
  --gpu-memory-utilization 0.92

# RTX 5090 only — Llama 3.1 70B at FP8 (does not fit on 4090)
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 8192 --max-num-seqs 4 \
  --gpu-memory-utilization 0.94

Production gotchas with the 5090

575W power draw is a chassis problem. Not every 4U server with 4090 cooling can keep a 5090 under thermal limits at sustained load. Expect to see throttling on any chassis without dedicated 5090 airflow design.
12V-2×6 connectors only. The older 12VHPWR is technically compatible but the seating issues that plagued early 4090 reports recur. Use 12V-2×6 cables and seat them firmly.
PCIe Gen 5 only matters for multi-card. Single-card inference does not saturate Gen 4 x16. The benefit appears only when you do tensor-parallel inference or NCCL all-reduce across cards.
FP4 calibration is non-trivial. Do not just flip --quantization fp4 in production without an eval pass. Some models lose 2-3 MMLU points.
Driver 555+ required. Older driver branches do not expose 5th-gen tensor instructions correctly. Pin your container base image accordingly.
NVENC 9th gen is great but not yet matched in FFmpeg releases. If you do video transcoding alongside inference, check that your FFmpeg build includes 9th-gen NVENC support.
Resale risk. The 5090’s high price means depreciation is steeper than the 4090, which is already at its second-hand floor. Consider this when sizing capex.

Which to pick

Pick the 4090 24GB if your model fits in 24GB at FP8 or AWQ; you are price-sensitive; you serve fewer than 100 concurrent users; or you want the lowest £/delivered-token. See the 4090 or 5090 decision guide.
Pick the 5090 32GB if you need 70B+ at FP8 on a single card; you need 32k+ context on Qwen 72B; you want FP4 for the largest models; or you are building a long-tenure rack where the 1.4x throughput compounds.
Pick neither if you actually need 96GB on one card — go to RTX 6000 Pro 96GB or an H100 80GB.

For a 200-MAU SaaS RAG on Llama 8B FP8, the 4090 is the clear choice. For a 70B-FP8 internal endpoint where capex is amortised over 24 months, the 5090 wins. For a 12-engineer coding team on Qwen 32B AWQ, both work, but the 4090 saves £800 per node.

Provision a 4090 today, evaluate the 5090 in three months

GigaGPU’s UK dedicated hosting offers the RTX 4090 24GB on properly cooled, FP8-ready images. Start serving today; migrate to a 5090 when you actually need the VRAM.

Order the RTX 4090 24GB

RTX 4090 24GB vs RTX 5090 32GB: Ada vs Blackwell Throughput Uplift

Contents

Spec sheet side by side

5th-gen vs 4th-gen tensor cores and FP4

Bandwidth, GDDR7 and the cache hierarchy

Throughput uplift across nine workloads

When 32GB unlocks new workloads

Power, economics and tokens-per-watt

Per-workload winner table

vLLM serving on each card

Production gotchas with the 5090

Which to pick

Provision a 4090 today, evaluate the 5090 in three months

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB vs RTX 5090 32GB: Ada vs Blackwell Throughput Uplift

Contents

Spec sheet side by side

5th-gen vs 4th-gen tensor cores and FP4

Bandwidth, GDDR7 and the cache hierarchy

Throughput uplift across nine workloads

When 32GB unlocks new workloads

Power, economics and tokens-per-watt

Per-workload winner table

vLLM serving on each card

Production gotchas with the 5090

Which to pick

Provision a 4090 today, evaluate the 5090 in three months

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 5060 Ti 16GB vs RTX 5060 8GB – The Ti Upgrade

RTX 4060 Ti 16GB vs RTX 5060 Blackwell for LLM Serving

Can RTX 5090 Run Mixtral 8x7B?

Can RTX 4060 Run Mistral 7B?

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?