Home / Blog / GPU Comparisons / RTX 4090 24GB vs RTX 5080 16GB: VRAM Beats Generation

GPU Comparisons

RTX 4090 24GB vs RTX 5080 16GB: VRAM Beats Generation

The RTX 5080 has Blackwell tensor cores and faster GDDR7, but only 16GB. The RTX 4090 24GB has older silicon and 50% more memory. For AI inference, which one wins on what — and where 16GB is a hard wall.

GPU Comparisons May 4, 2026 6 min read gigagpu

The RTX 5080 16GB is Blackwell’s mid-range. On paper it has 5th-generation tensor cores, native FP4, GDDR7 at 30 Gbps and PCIe Gen 5 — every architectural advantage the 4090 lacks. But it ships with 16GB of VRAM, two-thirds of the RTX 4090 24GB‘s buffer. For AI workloads on UK GPU hosting, that VRAM gap rules out an entire tier of models — Llama 3.1 70B INT4, Qwen 2.5 32B AWQ at long context, FLUX.1-dev with cached encoders. This deep-dive explains exactly which workloads the newer 5080 wins, where 16GB stops you cold, and how to think about the tradeoff for a typical 2026 production deployment.

Spec sheet side by side

Spec	RTX 4090 (Ada AD102)	RTX 5080 (Blackwell GB203)	Delta
Process	TSMC 4N	TSMC 4NP	Refined
SM count	128	84	-34%
CUDA cores	16,384	10,752	-34%
Tensor cores	512 (4th gen, FP8)	336 (5th gen, FP8 + FP4)	-34%, +FP4
Boost clock	2.52 GHz	2.62 GHz	+4%
VRAM	24 GB GDDR6X (21 Gbps)	16 GB GDDR7 (30 Gbps)	-33% capacity
Memory bandwidth	1008 GB/s	960 GB/s	-5%
Memory bus	384-bit	256-bit	Narrower
L2 cache	72 MB	~64 MB	-11%
FP16 dense TFLOPS	165	~135	-18%
FP8 dense TFLOPS	660 (sparse)	~540	-18%
FP4 dense TFLOPS	None	~1080	New
TDP	450W	360W	-20%
PCIe	Gen 4 x16	Gen 5 x16	2x effective

The 5080 is a smaller die and a smaller card. Bandwidth is essentially tied. The 4090 has 22% more raw FP8 throughput — but the 5080 supports FP4 at twice that. For workloads that fit in 16GB, the 5080 is competitive. For workloads that need 24GB, no Blackwell tensor-core advantage matters. See the RTX 4090 spec breakdown for the Ada side.

The 16GB wall — what does not fit

16GB is enough for any 7-9B model at FP8 or BF16, and any 14B model at AWQ INT4 with healthy context. Beyond that, you hit a wall fast. Llama 3.1 70B AWQ INT4 weighs 17 GB on disk and needs another 4-6 GB for KV cache and activations — it simply does not load on the 5080. Qwen 2.5 32B AWQ at 8k context fits with about 1 GB to spare; raise context to 16k and you are over the limit. FLUX.1-dev in FP16 peaks at 22 GB during the joint attention pass; you must run FP8 (or accept CPU offload, which murders latency).

Model / configuration	RTX 4090 24GB	RTX 5080 16GB
Llama 3.1 8B FP8 + 16k FP8 KV	Comfortable	Comfortable
Llama 3.1 8B FP8 + 64k FP8 KV	Tight	OOM
Qwen 2.5 14B AWQ + 16k context	Comfortable	Tight
Qwen 2.5 32B AWQ + 8k context	Comfortable	Tight
Qwen 2.5 32B AWQ + 16k context	Tight	OOM
Mixtral 8x7B AWQ	Comfortable	OOM (24 GB)
Llama 3.1 70B AWQ INT4	Tight (17+5 GB)	OOM
FLUX.1-dev FP16 30-step	Comfortable	OOM (22 GB peak)
FLUX.1-dev FP8 30-step	Comfortable	Comfortable
SDXL + Refiner cached	Comfortable	Tight

5th-gen tensor cores and what they buy you

5th-gen tensor cores add FP4 (E2M1 and MX-FP4 microscaling) and bump the Transformer Engine to handle automatic precision selection per layer. For models that fit in 16GB, the 5080 can use FP4 to roughly halve effective weight memory traffic, recovering throughput where the 4090’s FP8 path is bandwidth-bound. On Llama 3.1 8B in MX-FP4, the 5080 hits ~210 t/s decode batch 1 versus the 4090’s 198 t/s in FP8 — essentially tied, despite the 5080’s smaller die. On Qwen 2.5 14B AWQ (4-bit weights but FP16 KV), the comparison reverts: the 5080 is ~125 t/s vs the 4090 at 135, because the AWQ kernel does not benefit from FP4. Quality-wise, FP4 holds for Llama and Mistral families but loses 1-2 points on coding-heavy benchmarks (HumanEval) for Qwen Coder, so production deployments still ship FP8 by default. See FP8 on Ada for the architectural counterpart.

Per-workload throughput comparison

Workload	RTX 4090	RTX 5080	Winner
Llama 3.1 8B FP8 decode b1	198 t/s	185 t/s	4090 +7%
Llama 3.1 8B FP4 (MX) decode b1	n/a	210 t/s	5080 (FP4 only)
Llama 3.1 8B FP8 batch 32 agg	1100 t/s	980 t/s	4090 +12%
Qwen 2.5 14B AWQ decode b1	135 t/s	125 t/s	4090 +8%
Qwen 2.5 32B AWQ decode b1	65 t/s	OOM (16k ctx)	4090 only
Llama 70B AWQ INT4 decode b1	22-24 t/s	OOM	4090 only
SDXL 1024×1024 30-step	2.0s	2.2s	4090 +10%
FLUX.1-dev FP8 30-step	4.1s	4.4s	4090 +7%
FLUX.1-dev FP16 30-step	6.2s	OOM	4090 only
Whisper large-v3-turbo INT8	80x RT	72x RT	4090 +11%

For workloads both cards can run, the 4090 is consistently 7-12% faster — the larger die and slightly higher bandwidth more than compensate for the 5080’s tensor-core generation. The interesting cases are when 16GB is the limit.

Power, price and £/token

Metric	RTX 4090	RTX 5080
TDP	450W	360W
Sustained LLM b32	360W	290W
Tokens/Joule (Llama 8B FP8 b32)	3.05	3.38
UK price (typical 2026)	£1,300	£1,050
£/aggregate t/s (b32)	£1.18	£1.07
£/GB VRAM	£54	£66
Annual electricity @ 24/7 £0.18/kWh	£568	£457

£/token at batch 32 marginally favours the 5080 — but only for models that fit. The £/GB-VRAM number is brutal for the 5080. See the tokens-per-watt analysis and monthly hosting cost.

Per-workload winner table

Workload	Winner	Reason
200-MAU SaaS RAG on Llama 8B	4090	VRAM headroom for batch growth
Solo dev workstation, Llama 8B chat	5080	Cheaper, lower power, FP4 path
12-engineer Qwen Coder 32B AWQ	4090	5080 OOMs at long context
Llama 70B INT4 single-tenant	4090	5080 cannot fit
FLUX.1-dev studio	4090	FP16 path needs 24GB
SDXL freelance studio	5080	Both fit, 5080 cheaper
Voice agent (Whisper + 8B LLM)	5080	Both fit, 5080 cheaper to run
Mixtral 8x7B endpoint	4090	5080 cannot fit (24 GB weights)
Multi-tenant 8B FP8 endpoint	4090	More concurrent KV slots
Capex-bounded MVP under £1,200	5080	Cheapest Blackwell with FP4

vLLM serving examples

# RTX 4090 — Qwen 2.5 32B AWQ at 16k context, fits comfortably
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-32B-Instruct-AWQ \
  --quantization awq_marlin --kv-cache-dtype fp8_e4m3 \
  --max-model-len 16384 --max-num-seqs 4 \
  --gpu-memory-utilization 0.94

# RTX 5080 — same model would OOM at 16k.
# Drop to 8k context and pray, or switch to Qwen 14B AWQ:
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-14B-Instruct-AWQ \
  --quantization awq_marlin --kv-cache-dtype fp8_e4m3 \
  --max-model-len 16384 --max-num-seqs 8 \
  --gpu-memory-utilization 0.92

# RTX 5080 — Llama 8B in MX-FP4, the 5080's headline flex
docker run --rm --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 \
  --quantization fp4 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 32768 --max-num-seqs 16

Production gotchas

16GB pretends to be enough until it isn’t. Many tutorials show “Llama 70B fits in 16GB!” — they mean GGUF Q3, which loses meaningful quality. AWQ INT4 needs 21+ GB.
FP4 quality drops on coding models. Validate Qwen Coder 14B FP4 against your eval set before shipping; 1-2 HumanEval points commonly disappear.
5080 has narrower bus. The 256-bit memory bus means certain bandwidth-sensitive kernels (BGE-large embeddings at large batch) run slower than the 4090’s 384-bit despite both showing ~1 TB/s peak.
Multi-tenant batching is more constrained on 5080. Less KV-cache headroom means lower achievable --max-num-seqs for the same model. Plan for ~30% fewer concurrent sessions.
Driver 555+ required. Same as the 5090.
Future-proofing trap. “Newer architecture” is comforting but useless when your model needs 21 GB and you have 16. Buy for the model, not the press release.
Resale market thinner. 5080s are newer; secondary supply is shallow. The 4090 has a deep used market for upgrades or replacements.

Verdict

Pick the RTX 4090 24GB if you need to serve Llama 70B INT4, Qwen 32B at 16k+, Mixtral 8x7B, FLUX.1-dev FP16, or any multi-tenant workload where KV headroom matters. See the 4090 or 5080 decision guide.
Pick the RTX 5080 16GB if your model is 8-14B, you want the lowest electricity bill, you specifically want FP4 throughput, or you are price-bound at £1,050.
Pick neither if you need 32GB+ on a single card — go to RTX 5090 32GB or RTX 6000 Pro 96GB.

For a 200-MAU SaaS on Llama 8B, both work, but the 4090 gives you the option to upgrade the model to Mistral Small 3 24B without a hardware swap. For a 12-engineer coding team running Qwen Coder 32B, the 5080 is a non-starter.

VRAM is the spec that matters most

GigaGPU’s UK dedicated hosting puts you on a 24GB RTX 4090 with the headroom to grow your model and your batch size without re-provisioning.

Order the RTX 4090 24GB

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

GPU Comparisons

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RTX 4090 24GB vs RTX 5080 16GB: VRAM Beats Generation

Contents

Spec sheet side by side

The 16GB wall — what does not fit

5th-gen tensor cores and what they buy you

Per-workload throughput comparison

Power, price and £/token

Per-workload winner table

vLLM serving examples

Production gotchas

Verdict

VRAM is the spec that matters most

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB vs RTX 5080 16GB: VRAM Beats Generation

Contents

Spec sheet side by side

The 16GB wall — what does not fit

5th-gen tensor cores and what they buy you

Per-workload throughput comparison

Power, price and £/token

Per-workload winner table

vLLM serving examples

Production gotchas

Verdict

VRAM is the spec that matters most

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 4060 Ti for AI: The 16GB Sweet Spot?

LLaMA 3 8B vs Gemma 2 9B for Code Generation: GPU Benchmark

Ryzen AI Max+ 395 vs RTX 6000 Pro – Unified Memory Tradeoffs

RTX 5060 Ti 16GB vs Repurposed RTX A5000

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?