The RTX 5080 16GB is Blackwell’s mid-range. On paper it has 5th-generation tensor cores, native FP4, GDDR7 at 30 Gbps and PCIe Gen 5 — every architectural advantage the 4090 lacks. But it ships with 16GB of VRAM, two-thirds of the RTX 4090 24GB‘s buffer. For AI workloads on UK GPU hosting, that VRAM gap rules out an entire tier of models — Llama 3.1 70B INT4, Qwen 2.5 32B AWQ at long context, FLUX.1-dev with cached encoders. This deep-dive explains exactly which workloads the newer 5080 wins, where 16GB stops you cold, and how to think about the tradeoff for a typical 2026 production deployment.
Contents
- Spec sheet side by side
- The 16GB wall — what does not fit
- 5th-gen tensor cores and what they buy you
- Per-workload throughput comparison
- Power, price and £/token
- Per-workload winner table
- vLLM serving examples
- Production gotchas
- Verdict
Spec sheet side by side
| Spec | RTX 4090 (Ada AD102) | RTX 5080 (Blackwell GB203) | Delta |
|---|---|---|---|
| Process | TSMC 4N | TSMC 4NP | Refined |
| SM count | 128 | 84 | -34% |
| CUDA cores | 16,384 | 10,752 | -34% |
| Tensor cores | 512 (4th gen, FP8) | 336 (5th gen, FP8 + FP4) | -34%, +FP4 |
| Boost clock | 2.52 GHz | 2.62 GHz | +4% |
| VRAM | 24 GB GDDR6X (21 Gbps) | 16 GB GDDR7 (30 Gbps) | -33% capacity |
| Memory bandwidth | 1008 GB/s | 960 GB/s | -5% |
| Memory bus | 384-bit | 256-bit | Narrower |
| L2 cache | 72 MB | ~64 MB | -11% |
| FP16 dense TFLOPS | 165 | ~135 | -18% |
| FP8 dense TFLOPS | 660 (sparse) | ~540 | -18% |
| FP4 dense TFLOPS | None | ~1080 | New |
| TDP | 450W | 360W | -20% |
| PCIe | Gen 4 x16 | Gen 5 x16 | 2x effective |
The 5080 is a smaller die and a smaller card. Bandwidth is essentially tied. The 4090 has 22% more raw FP8 throughput — but the 5080 supports FP4 at twice that. For workloads that fit in 16GB, the 5080 is competitive. For workloads that need 24GB, no Blackwell tensor-core advantage matters. See the RTX 4090 spec breakdown for the Ada side.
The 16GB wall — what does not fit
16GB is enough for any 7-9B model at FP8 or BF16, and any 14B model at AWQ INT4 with healthy context. Beyond that, you hit a wall fast. Llama 3.1 70B AWQ INT4 weighs 17 GB on disk and needs another 4-6 GB for KV cache and activations — it simply does not load on the 5080. Qwen 2.5 32B AWQ at 8k context fits with about 1 GB to spare; raise context to 16k and you are over the limit. FLUX.1-dev in FP16 peaks at 22 GB during the joint attention pass; you must run FP8 (or accept CPU offload, which murders latency).
| Model / configuration | RTX 4090 24GB | RTX 5080 16GB |
|---|---|---|
| Llama 3.1 8B FP8 + 16k FP8 KV | Comfortable | Comfortable |
| Llama 3.1 8B FP8 + 64k FP8 KV | Tight | OOM |
| Qwen 2.5 14B AWQ + 16k context | Comfortable | Tight |
| Qwen 2.5 32B AWQ + 8k context | Comfortable | Tight |
| Qwen 2.5 32B AWQ + 16k context | Tight | OOM |
| Mixtral 8x7B AWQ | Comfortable | OOM (24 GB) |
| Llama 3.1 70B AWQ INT4 | Tight (17+5 GB) | OOM |
| FLUX.1-dev FP16 30-step | Comfortable | OOM (22 GB peak) |
| FLUX.1-dev FP8 30-step | Comfortable | Comfortable |
| SDXL + Refiner cached | Comfortable | Tight |
5th-gen tensor cores and what they buy you
5th-gen tensor cores add FP4 (E2M1 and MX-FP4 microscaling) and bump the Transformer Engine to handle automatic precision selection per layer. For models that fit in 16GB, the 5080 can use FP4 to roughly halve effective weight memory traffic, recovering throughput where the 4090’s FP8 path is bandwidth-bound. On Llama 3.1 8B in MX-FP4, the 5080 hits ~210 t/s decode batch 1 versus the 4090’s 198 t/s in FP8 — essentially tied, despite the 5080’s smaller die. On Qwen 2.5 14B AWQ (4-bit weights but FP16 KV), the comparison reverts: the 5080 is ~125 t/s vs the 4090 at 135, because the AWQ kernel does not benefit from FP4. Quality-wise, FP4 holds for Llama and Mistral families but loses 1-2 points on coding-heavy benchmarks (HumanEval) for Qwen Coder, so production deployments still ship FP8 by default. See FP8 on Ada for the architectural counterpart.
Per-workload throughput comparison
| Workload | RTX 4090 | RTX 5080 | Winner |
|---|---|---|---|
| Llama 3.1 8B FP8 decode b1 | 198 t/s | 185 t/s | 4090 +7% |
| Llama 3.1 8B FP4 (MX) decode b1 | n/a | 210 t/s | 5080 (FP4 only) |
| Llama 3.1 8B FP8 batch 32 agg | 1100 t/s | 980 t/s | 4090 +12% |
| Qwen 2.5 14B AWQ decode b1 | 135 t/s | 125 t/s | 4090 +8% |
| Qwen 2.5 32B AWQ decode b1 | 65 t/s | OOM (16k ctx) | 4090 only |
| Llama 70B AWQ INT4 decode b1 | 22-24 t/s | OOM | 4090 only |
| SDXL 1024×1024 30-step | 2.0s | 2.2s | 4090 +10% |
| FLUX.1-dev FP8 30-step | 4.1s | 4.4s | 4090 +7% |
| FLUX.1-dev FP16 30-step | 6.2s | OOM | 4090 only |
| Whisper large-v3-turbo INT8 | 80x RT | 72x RT | 4090 +11% |
For workloads both cards can run, the 4090 is consistently 7-12% faster — the larger die and slightly higher bandwidth more than compensate for the 5080’s tensor-core generation. The interesting cases are when 16GB is the limit.
Power, price and £/token
| Metric | RTX 4090 | RTX 5080 |
|---|---|---|
| TDP | 450W | 360W |
| Sustained LLM b32 | 360W | 290W |
| Tokens/Joule (Llama 8B FP8 b32) | 3.05 | 3.38 |
| UK price (typical 2026) | £1,300 | £1,050 |
| £/aggregate t/s (b32) | £1.18 | £1.07 |
| £/GB VRAM | £54 | £66 |
| Annual electricity @ 24/7 £0.18/kWh | £568 | £457 |
£/token at batch 32 marginally favours the 5080 — but only for models that fit. The £/GB-VRAM number is brutal for the 5080. See the tokens-per-watt analysis and monthly hosting cost.
Per-workload winner table
| Workload | Winner | Reason |
|---|---|---|
| 200-MAU SaaS RAG on Llama 8B | 4090 | VRAM headroom for batch growth |
| Solo dev workstation, Llama 8B chat | 5080 | Cheaper, lower power, FP4 path |
| 12-engineer Qwen Coder 32B AWQ | 4090 | 5080 OOMs at long context |
| Llama 70B INT4 single-tenant | 4090 | 5080 cannot fit |
| FLUX.1-dev studio | 4090 | FP16 path needs 24GB |
| SDXL freelance studio | 5080 | Both fit, 5080 cheaper |
| Voice agent (Whisper + 8B LLM) | 5080 | Both fit, 5080 cheaper to run |
| Mixtral 8x7B endpoint | 4090 | 5080 cannot fit (24 GB weights) |
| Multi-tenant 8B FP8 endpoint | 4090 | More concurrent KV slots |
| Capex-bounded MVP under £1,200 | 5080 | Cheapest Blackwell with FP4 |
vLLM serving examples
# RTX 4090 — Qwen 2.5 32B AWQ at 16k context, fits comfortably
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-32B-Instruct-AWQ \
--quantization awq_marlin --kv-cache-dtype fp8_e4m3 \
--max-model-len 16384 --max-num-seqs 4 \
--gpu-memory-utilization 0.94
# RTX 5080 — same model would OOM at 16k.
# Drop to 8k context and pray, or switch to Qwen 14B AWQ:
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-14B-Instruct-AWQ \
--quantization awq_marlin --kv-cache-dtype fp8_e4m3 \
--max-model-len 16384 --max-num-seqs 8 \
--gpu-memory-utilization 0.92
# RTX 5080 — Llama 8B in MX-FP4, the 5080's headline flex
docker run --rm --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 \
--quantization fp4 --kv-cache-dtype fp8_e4m3 \
--max-model-len 32768 --max-num-seqs 16
Production gotchas
- 16GB pretends to be enough until it isn’t. Many tutorials show “Llama 70B fits in 16GB!” — they mean GGUF Q3, which loses meaningful quality. AWQ INT4 needs 21+ GB.
- FP4 quality drops on coding models. Validate Qwen Coder 14B FP4 against your eval set before shipping; 1-2 HumanEval points commonly disappear.
- 5080 has narrower bus. The 256-bit memory bus means certain bandwidth-sensitive kernels (BGE-large embeddings at large batch) run slower than the 4090’s 384-bit despite both showing ~1 TB/s peak.
- Multi-tenant batching is more constrained on 5080. Less KV-cache headroom means lower achievable
--max-num-seqsfor the same model. Plan for ~30% fewer concurrent sessions. - Driver 555+ required. Same as the 5090.
- Future-proofing trap. “Newer architecture” is comforting but useless when your model needs 21 GB and you have 16. Buy for the model, not the press release.
- Resale market thinner. 5080s are newer; secondary supply is shallow. The 4090 has a deep used market for upgrades or replacements.
Verdict
- Pick the RTX 4090 24GB if you need to serve Llama 70B INT4, Qwen 32B at 16k+, Mixtral 8x7B, FLUX.1-dev FP16, or any multi-tenant workload where KV headroom matters. See the 4090 or 5080 decision guide.
- Pick the RTX 5080 16GB if your model is 8-14B, you want the lowest electricity bill, you specifically want FP4 throughput, or you are price-bound at £1,050.
- Pick neither if you need 32GB+ on a single card — go to RTX 5090 32GB or RTX 6000 Pro 96GB.
For a 200-MAU SaaS on Llama 8B, both work, but the 4090 gives you the option to upgrade the model to Mistral Small 3 24B without a hardware swap. For a 12-engineer coding team running Qwen Coder 32B, the 5080 is a non-starter.
VRAM is the spec that matters most
GigaGPU’s UK dedicated hosting puts you on a 24GB RTX 4090 with the headroom to grow your model and your batch size without re-provisioning.
Order the RTX 4090 24GBSee also: vs RTX 5090 32GB, vs RTX 5060 Ti 16GB, RTX 4090 spec breakdown, 2026 tier positioning, Llama 70B INT4 deployment, FP8 tensor cores on Ada, 4090 or 5080 decision.