The RTX 5080 16GB is the newer card with Blackwell GB203, GDDR7 and 5th-gen FP8 silicon, but it ships with one-third less VRAM than the RTX 4090 24GB. For Llama 70B AWQ INT4 that gap is decisive – the 4090 fits, the 5080 does not. For Llama 8B FP8 chat APIs they trade blows, with the 5080 winning on watts-per-token and the 4090 winning on raw throughput per card. This guide walks through the decision with concrete numbers and a 10-workload winner table, with both available via UK dedicated 4090 hosting and the broader gigagpu range.
Contents
- Spec sheet
- Throughput head-to-head
- Model fit and the 8GB gap
- Cost-per-token and watts-per-token
- Per-workload winner (10 workloads)
- Three production scenarios
- Production gotchas
- Verdict and when each card wins
Spec sheet
| Spec | RTX 4090 24GB | RTX 5080 16GB |
|---|---|---|
| Architecture | Ada AD102 | Blackwell GB203 |
| CUDA cores | 16,384 | 10,752 |
| Tensor cores | 512 (4th gen) | 336 (5th gen) |
| VRAM | 24GB GDDR6X | 16GB GDDR7 |
| Bandwidth | 1,008 GB/s | 960 GB/s |
| TDP | 450W | 360W |
| FP8 generation | 4th gen | 5th gen |
| FP4 native | No | Yes |
| FP16 TFLOPS dense | 165 | ~150 |
| PCIe | Gen4 x16 | Gen5 x16 |
| Launch year | 2022 | 2025 |
| Approx UK dedicated £/mo | £550 | £475 |
Throughput head-to-head
The 5080 has fewer CUDA cores but newer tensor cores, slightly less memory bandwidth, and a meaningfully lower TDP. Net result: small workloads are very close in absolute t/s, large workloads tilt toward the 4090 because of VRAM, and FP8 small models go to the 5080 by a clear margin on per-watt efficiency. The 4090 retains a 5-15% raw throughput edge on most chat workloads thanks to its higher CUDA core count.
| Workload | 4090 t/s | 5080 t/s | Winner (raw) |
|---|---|---|---|
| Llama 3.1 8B FP8 batch 1 | 198 | 185 | 4090 (just) |
| Llama 3.1 8B FP8 concurrency 8 | ~1,100 aggr | ~960 aggr | 4090 |
| Llama 3.1 8B FP8 concurrency 32 | ~1,800 aggr | ~1,500 aggr | 4090 |
| Llama 3.1 70B AWQ INT4 | 22 | OOM | 4090 only |
| Qwen 2.5 14B FP8 | 120 | 108 | 4090 |
| Mistral 7B FP8 batch 1 | 220 | 215 | Tied |
| SDXL 1024×1024, 30 steps | 3.4s | 3.6s | 4090 |
| Flux.1 Dev 1024×1024 | 14s | 15s | 4090 (offload) |
| Whisper Large v3, 1hr | 22s | 24s | 4090 |
| Tokens per watt (8B FP8) | 0.44 | 0.51 | 5080 |
Model fit and the 8GB gap
The 8GB VRAM gap is what actually drives this decision. With FP8 KV caches, the 4090 holds Llama 70B AWQ INT4 (~17GB weights) and still has room. The 5080 cannot run any 70B-class model in any quantisation that preserves quality. Mixtral 8x7B AWQ INT4 (~25GB) is impossible on 5080. Qwen 14B FP8 fits but with no headroom for KV.
| Model | 4090 24GB | 5080 16GB |
|---|---|---|
| Llama 3.1 8B FP8 (4k context) | Fits, 16GB free for KV | Fits, 8GB free for KV |
| Llama 3.1 8B FP8 (32k context) | Comfortable | Tight |
| Llama 3.1 8B FP16 | Fits | Tight (~16GB total) |
| Qwen 2.5 14B FP8 | Fits with KV | Tight, low concurrency only |
| Llama 3.1 70B AWQ INT4 | Fits | OOM |
| Mixtral 8x7B AWQ | Tight (~25GB) | OOM |
| SDXL + refiner | Fits | Fits with offload |
| Flux.1 Dev BF16 | Fits with offload | OOM without aggressive offload |
| Stable Video Diffusion | Fits | Tight, offload needed |
Cost-per-token and watts-per-token
Assume £550/month for a 4090 and £475/month for a 5080. The 4090 produces marginally more throughput on FP8 chat, so cost-per-token is close on small models. On 70B work the 5080 cannot compete because it cannot run the model. On per-watt efficiency the 5080 wins by roughly 15% across small-model workloads.
| Workload | 4090 £/M tok | 5080 £/M tok | 4090 W/Mtok | 5080 W/Mtok | Winner £/tok |
|---|---|---|---|---|---|
| Llama 8B FP8 chat 24/7 conc 8 | £0.039 | £0.034 | 0.061 | 0.054 | 5080 |
| Qwen 14B FP8 chat 24/7 conc 8 | £0.063 | £0.063 | 0.10 | 0.097 | Tied |
| Llama 70B AWQ INT4 conc 4 | £0.34 | n/a | 0.66 | n/a | 4090 only |
| Mistral 7B FP8 24/7 | £0.034 | £0.030 | 0.054 | 0.046 | 5080 |
| SDXL £/image queue | £0.0009 | £0.0009 | 0.0014 | 0.0013 | Tied |
| Flux.1 Dev £/image | £0.0036 | £0.0040 | 0.0056 | 0.0058 | 4090 |
Per-workload winner (10 workloads)
| Workload | 4090 wins | 5080 wins | Why |
|---|---|---|---|
| Llama 8B FP8 chat, cost-bound | No | Yes | 5080 cheaper £/token |
| Llama 70B AWQ INT4 single card | Yes | No | 5080 OOM |
| Mixtral 8x7B AWQ | Yes | No | 5080 OOM |
| Qwen 14B FP8 high concurrency | Yes | No | 5080 KV-bound |
| SDXL image gen high volume | Yes | No | 5080 slower per image |
| Flux.1 Dev image gen | Yes | No | 5080 needs aggressive offload |
| Watts-bound deployment (datacentre) | No | Yes | 5080 360W vs 4090 450W |
| FP4 quantised inference | No | Yes | 4090 no native FP4 |
| Mixed inference (LLM + image + audio) | Yes | No | 4090 VRAM headroom |
| Future-proof 2-3 year deployment | No | Yes | Newer architecture, longer support |
Three production scenarios
Scenario A: 8B chatbot for a SaaS product
Steady traffic, 20-50 concurrent users on UK business hours, no roadmap to 70B. The 5080 is roughly 12% cheaper per token and 13% more efficient per watt. Pick 5080 unless you need future headroom for larger models. Cross-reference SaaS RAG sizing and concurrent users.
Scenario B: RAG service that needs Llama 70B
Open-weight Llama 70B AWQ INT4 is the quality target for substantive document QA. Only the 4090 can run the model on a single card. Pick 4090, no contest. See 70B INT4 deployment.
Scenario C: Mixed workload (8B inference + occasional Mixtral)
Primary workload is 8B FP8 chat but you occasionally need Mixtral 8x7B for analytical work. Mixtral does not fit on the 5080. Pick 4090 unless you can route Mixtral to a separate host.
Production gotchas
- 5080’s 16GB is a hard ceiling on model menu. Any roadmap that touches 14B+ FP8 with reasonable context, 70B AWQ, or Mixtral hits a wall. Plan for the largest model in your 18-month roadmap.
- Flux.1 Dev needs aggressive CPU offload on 5080. Per-image latency rises 30-50% over a 4090 once offload kicks in. For high-volume image queues this matters.
- 5080 needs CUDA 12.8+ and recent inference stacks. Same Blackwell tooling caveats as 5090. Pin container versions carefully.
- Per-watt advantage erodes under sustained load. The 5080’s 360W TDP is real, but with continuous batching at high concurrency both cards run near peak. The efficiency gap shrinks from 15% to 5-8% in production.
- FP4 still maturing in 2026. The 5080’s FP4 silicon is real but model coverage is uneven. Do not assume FP4 throughput on day one.
- 4090 mature and well-supported. Every inference framework has battle-tested 4090 paths. New Blackwell features sometimes have rough edges in early releases.
- Resale and lease economics. 4090 pricing has stabilised on the secondary market; 5080 supply remains constrained. Affects buy-vs-rent calculus.
Verdict and when each card wins
The 5080 is a solid card for sub-16GB workloads where watts-per-token and £/token both matter and the model menu is narrow (8B FP8, Mistral 7B, Qwen up to 14B at modest context). The moment your roadmap touches a 70B-class open model, Mixtral 8x7B, image generation at scale, or any workload that wants 24GB of working memory, the extra 8GB on the 4090 is worth more than the architectural step from Ada to Blackwell. For most teams running open-weight inference in 2026 with any meaningful model variety, the 4090’s VRAM headroom still wins. Order via GigaGPU dedicated hosting.
24GB beats 16GB on real models
Llama 70B AWQ INT4, Mixtral 8x7B AWQ, Qwen 14B FP8 with full KV – all fit cleanly on the 4090’s 24GB. UK dedicated hosting.
Order the RTX 4090 24GBSee also: 4090 vs 5080 spec deep-dive, Llama 70B INT4 benchmark, spec breakdown, 70B INT4 VRAM, or 5090 decision, or 5060 Ti decision, or 3090 decision, FP8 tensor cores, tier positioning 2026, tokens per watt, power draw efficiency, Llama 8B benchmark, FP8 deployment, for SaaS RAG, concurrent users, 70B INT4 deployment.