Table of Contents
Overview: The Core Trade-off
The RTX 5080 16GB and the RTX 3090 24GB sit on opposite sides of a trade-off that defines a lot of 2026 GPU shopping decisions: newer compute or more VRAM? The 5080 is a Blackwell card built on the GB203 die, with 5th-gen tensor cores that include native FP4 support, GDDR7 memory, PCIe Gen 5 and NVENC with AV1 encode. It is roughly 60% faster than the 3090 in raw FP32, and several multiples faster in low-precision tensor throughput thanks to FP8 and FP4 paths the older card simply does not have.
The RTX 3090, by contrast, is an Ampere card on GA102. It launched in 2020 at $1,499, was Nvidia’s first prosumer 24GB GeForce, and despite being two generations old it remains the cheapest sane way to put 24GB of CUDA-compatible VRAM in a workstation. It has no FP8 or FP4 tensor paths, its NVENC is two generations behind, and it is famously hot-running. But that 24GB sticks around as a hard, decisive advantage in workloads where 16GB simply does not fit.
This post puts the two cards side by side with hard numbers across LLM inference, image generation, fine-tuning and price-per-VRAM. The verdict is workload-dependent, and if you are weighing similar trade-offs further up the stack, our RTX 4090 vs RTX 5090 deep-dive covers the same generational question one tier above.
Spec Sheet, Side by Side
Before we get into benchmarks, the headline silicon difference: Blackwell’s much larger L2 cache and faster GDDR7 give the 5080 better effective memory utilisation per gigabyte, while the 3090 simply has more raw capacity. Both cards are built around the same broad CUDA core count and similar TDPs, so the differences come from architecture, not transistor budget.
| Specification | RTX 5080 16GB | RTX 3090 24GB |
|---|---|---|
| Architecture | Blackwell (GB203) | Ampere (GA102) |
| Process | TSMC 4NP | Samsung 8N |
| Launch | January 2025 | September 2020 |
| CUDA cores | 10,752 | 10,496 |
| Tensor cores | 336 (5th gen, FP4 native) | 328 (3rd gen, FP16/INT8) |
| RT cores | 84 (4th gen) | 82 (2nd gen) |
| Base / Boost clock | 2.30 / 2.62 GHz | 1.40 / 1.70 GHz |
| VRAM | 16 GB GDDR7 | 24 GB GDDR6X |
| Memory bus | 256-bit | 384-bit |
| Memory bandwidth | 960 GB/s | 936 GB/s |
| L2 cache | 64 MB | 6 MB |
| TDP | 360 W | 350 W |
| PCIe | Gen 5 x16 | Gen 4 x16 |
| NVENC | 9th gen (AV1 enc/dec) | 7th gen (no AV1 encode) |
| Display | DisplayPort 2.1b UHBR20 | DisplayPort 1.4a |
| Launch price (USD) | ~$999 | $1,499 |
The L2 cache jump from 6MB to 64MB is the quiet headline. It dramatically reduces how often Blackwell has to round-trip to GDDR7 during attention and matmul, which is part of why the 5080 punches above its theoretical bandwidth advantage. GDDR7 itself is only ~3% faster than the 3090’s GDDR6X on paper, but the cache hierarchy makes the effective gap much wider in practice.
Raw Compute Comparison
Below are dense (non-sparse) tensor throughput figures published by Nvidia, with FP32 measured on standard CUDA paths. Sparse mode roughly doubles the tensor numbers if your model supports 2:4 structured sparsity, which most production LLMs do not.
| Precision | RTX 5080 16GB | RTX 3090 24GB | Speedup (5080) |
|---|---|---|---|
| FP32 (CUDA) | ~56 TFLOPs | ~35 TFLOPs | 1.6x |
| FP16 / BF16 tensor | ~225 TFLOPs | ~142 TFLOPs | 1.6x |
| INT8 tensor | ~450 TOPS | ~284 TOPS | 1.6x |
| FP8 tensor | ~450 TFLOPs | Not supported | n/a |
| FP4 tensor | ~900 TFLOPs | Not supported | n/a |
If your stack lives in FP16 or BF16, the 5080 is around 1.6x faster than a 3090 on raw matmul, which mirrors the FP32 ratio almost exactly. The bigger story is FP8 and FP4: the 3090 simply cannot run them as native tensor ops, so any framework that targets FP8 (Marlin, vLLM’s FP8 KV cache, TensorRT-LLM’s FP8 paths) gets a free 2x or more on the 5080 with no equivalent on the 3090. Our FP8 Llama deployment guide walks through what FP8 actually does in production and why it matters more than the headline TFLOP number suggests.
That said, raw TFLOPs only matter if the workload is compute-bound. LLM token generation is memory-bound, image generation is mixed, and full fine-tuning is activation-memory-bound. The compute table is the start of the analysis, not the end.
LLM Inference, Per Model
This is where the VRAM gap bites hardest. A dense LLM’s weights have to fit, plus KV cache that grows linearly with batch size and context length, plus framework overhead. The figures below assume vLLM 0.6+ with paged attention, batch size 1 unless stated, 2k context window, and the highest precision the card can run natively.
| Model | Quantisation | RTX 5080 16GB | RTX 3090 24GB |
|---|---|---|---|
| Llama 3.1 8B | FP16 | ~85 tok/s (fits) | ~70 tok/s (fits) |
| Llama 3.1 8B | FP8 (W8A8) | ~140 tok/s | Not native |
| Mistral 7B | FP16 | ~92 tok/s | ~74 tok/s |
| Llama 3 13B | FP16 | OOM (needs ~26GB) | ~38 tok/s |
| Llama 3 13B | AWQ INT4 | ~95 tok/s | ~70 tok/s |
| Qwen 2.5 32B | AWQ INT4 | OOM (16GB too tight) | ~28 tok/s (KV pruned) |
| Llama 3.1 70B | AWQ INT4 (~17GB weights) | OOM with KV cache | ~14 tok/s (single user) |
The pattern is clean. For 7B-class models the 5080 is roughly 20-25% faster in FP16 and almost 2x faster if you can drop to FP8. For 13B-class models in INT4 both fit but the 5080 is faster. For anything from 32B upward, the 16GB ceiling becomes the deciding factor and the 3090 becomes the only option in this price bracket. Llama 3 70B at INT4 is a particularly notable case because the weights themselves are around 17GB, so even before you allocate any KV cache the 5080 has already overflowed.
If 70B-class hosting is your goal, see our full Llama 3 VRAM requirements breakdown and the AWQ quantisation guide, both of which apply equally to 24GB Ampere and Ada cards. For deployment patterns that maximise tokens per second per pound, our vLLM production setup guide covers the framework choices that make the most of either card.
Image Generation, Per Model
Diffusion is mixed compute and memory pressure. Stable Diffusion XL fits comfortably in either card; Flux.1 dev FP16 does not fit in 16GB without aggressive offloading and quality-degrading quantisation. Times below are end-to-end with default schedulers, single image, 1024×1024, no batching, ComfyUI as the backend.
| Workload | RTX 5080 16GB | RTX 3090 24GB |
|---|---|---|
| SDXL FP16, 30 steps, 1024² | ~6.0 s | ~8.2 s |
| SDXL Turbo FP16, 4 steps | ~0.9 s | ~1.3 s |
| Flux.1 dev FP16 (~24GB needed) | OOM without offload | ~22 s native |
| Flux.1 dev Q5 GGUF / FP8 | ~12 s | ~17 s |
| Flux.1 schnell FP8, 4 steps | ~0.5 s | ~1.4 s (FP16 fallback) |
| SD3.5 Large FP16 | OOM | ~14 s |
The 5080 wins SDXL by roughly 25-30% per generation, which compounds quickly across thousands of images. It also wins Flux.1 schnell decisively because schnell only needs four steps and benefits from FP8 inference paths the 3090 cannot use. But Flux.1 dev at full FP16 is a 24GB model in practice, and if you want native fidelity rather than a Q5 GGUF approximation you need the 3090. The same applies to SD3.5 Large in FP16.
This mirrors what we see across Ada and Blackwell more broadly. If you are hesitating between newer 16GB cards and older 24GB cards for diffusion specifically, our RTX 4090 spec breakdown and the RTX 5060 Ti FP8 guide give useful upper and lower bounds on the same architectural trade-off.
Training and Fine-tuning
Training is where activation memory dominates and the 16GB limit really starts to hurt. LoRA is fine on either card because it only updates rank-decomposed adapter weights. Full fine-tuning needs full activations, optimiser states (Adam = 2x weight memory) and gradients (1x weight memory) all in VRAM at once.
| Workload | RTX 5080 16GB | RTX 3090 24GB |
|---|---|---|
| LoRA fine-tune 7B (rank 16, ~12GB) | Fits, ~1.6x faster wall-clock | Fits |
| QLoRA 13B/14B (4-bit base) | Fits comfortably | Fits comfortably |
| QLoRA 32B (4-bit base, ~18GB) | OOM | Fits, ~9 sec/step |
| QLoRA 70B (4-bit base, ~38GB) | OOM | OOM (needs 48GB+) |
| Full fine-tune 7B FP16 | OOM with optimiser | Borderline, gradient checkpointing required |
| Full fine-tune 7B BF16 + DeepSpeed Zero-2 | OOM | Fits with offload |
For LoRA on 7B-class models, take the 5080. The training step is dominated by matmul, the 5080 is roughly 1.6x faster, and the fixed overhead easily fits in 16GB. For QLoRA on 14B and below, either card works; the 5080 will finish faster but the 3090 will not OOM. From 32B upward, the 3090 becomes the only single-GPU option in this bracket. The 70B class needs an A6000 48GB or a multi-GPU setup, which our dedicated GPU hosting page can spec out.
Practical Decision Matrix
The benchmark numbers boil down to a small number of clear use-case wins. If your workload appears here, the answer is unambiguous.
| Workload / Use Case | Winner | Why |
|---|---|---|
| Single-user chatbot, 7B class | RTX 5080 | Faster tokens/sec, FP8 path available |
| Batched API serving, 7B class | RTX 5080 | FP8 throughput dominates batched inference |
| Hosting Llama 3 70B AWQ for one user | RTX 3090 | Only card with VRAM headroom for INT4 weights + KV |
| Hosting Qwen 2.5 32B | RTX 3090 | 16GB OOMs with any practical batch size |
| SDXL image generation | RTX 5080 | 25-30% faster, AV1 encode for video pipelines |
| Flux.1 dev FP16 image generation | RTX 3090 | Native fit, no quality-loss quantisation needed |
| Flux.1 schnell FP8 generation | RTX 5080 | FP8 path, 4-step schedule, lowest latency |
| Real-time multimedia (FP4 latency-critical) | RTX 5080 | Only card with native FP4 tensor cores |
| QLoRA fine-tune of 7B / 14B | Either (5080 faster) | Both fit, 5080 finishes ~60% sooner |
| QLoRA fine-tune of 32B | RTX 3090 | 4-bit weights need ~18GB, OOMs on 5080 |
| Long-context inference (32k+ tokens) | RTX 3090 | KV cache grows linearly with context |
| Video transcoding pipelines (AV1) | RTX 5080 | 9th-gen NVENC with AV1 encode, 3090 has none |
If your workload spans several rows in this table, the question becomes how often the VRAM ceiling is hit. Most teams underestimate this — KV cache for batch=16 on a 7B model at 8k context is already several gigabytes — so if you are uncertain, err on the side of more VRAM. Our best GPU for LLM inference guide has a deeper sizing methodology.
Power, Noise and Ecosystem Notes
TDP is essentially identical: 360W for the 5080 versus 350W for the 3090. The 5080 is far more efficient per watt because it does more compute per joule, so for the same workload it will draw less average power and finish sooner, idling earlier. The 3090’s GDDR6X modules sit on the back of the PCB on Founders Edition cards and famously hit thermal limits — if you are buying used, prioritise board partner cards (Asus Strix, EVGA FTW3, MSI Suprim) and check memory junction temperatures before deploying.
On the software side, the 5080 has access to DLSS 4 with multi-frame generation, the latest TensorRT-LLM kernels with FP4 paths, current Nvidia drivers without compatibility caveats, and a noticeably lower CUDA driver overhead per launch. The 3090 still receives mainline driver support and CUDA 12.x compatibility, but it is increasingly being positioned as a legacy gaming card by Nvidia, which means optimisation work for new tensor primitives lands on Ada and Blackwell first.
Noise and physical footprint differ too: the 5080 Founders Edition is a 2-slot card, while most 3090 board partner cards are 3-slot or larger. For dense rack deployments this matters, and it is one reason hosting providers prefer the newer cards — but in practice, a properly cooled 3090 in a colocated chassis is still entirely viable in 2026. Our RTX 3090 hosting page lists the chassis configurations we use.
Pricing and Availability in 2026
Pricing is the lever that often decides the whole question. The 5080 retails at around £999 new in the UK, with stock that has historically been tight at launch but has eased in 2026. The 3090 has not been manufactured for years, so its market is entirely used: typical UK prices in mid-2026 sit between £700 and £900 depending on cosmetic condition and warranty.
| Card | Typical UK price (2026) | VRAM | Cost per GB VRAM |
|---|---|---|---|
| RTX 5080 16GB (new) | £999 | 16 GB | £62.4 / GB |
| RTX 3090 24GB (used, average) | £800 | 24 GB | £33.3 / GB |
| RTX 3090 24GB (used, premium board) | £900 | 24 GB | £37.5 / GB |
| RTX 4090 24GB (new, where available) | £1,800 | 24 GB | £75.0 / GB |
| RTX 5090 32GB (new) | £1,999 | 32 GB | £62.5 / GB |
On a pure cost-per-GB basis the used 3090 wins by a comfortable margin, and that is exactly the trade-off it has occupied since 2022. You are buying VRAM at a discount in exchange for two generations of compute. For workloads where VRAM is binary — either you fit the model or you don’t — that is the right deal. For workloads where compute speed and FP8/FP4 paths matter more than headroom, the 5080 is better value despite the higher per-GB number.
If you want to skip the buying decision entirely and rent rather than own, both cards are available on monthly hosting plans. See our RTX 4090 hosting cost breakdown for the equivalent maths on the next tier up, the cheapest GPU for AI inference for entry-tier options, and the cost per 1M tokens analysis for whether self-hosting beats OpenAI API pricing on your usage profile. Our 4090 vs 5090 decision guide applies the same compute-versus-VRAM analysis one tier up.
Verdict
The RTX 5080 wins on newer codecs, FP4 and FP8 throughput, better single-stream latency for 7B-class models, and lower power per workload. The RTX 3090 wins on raw VRAM — and that is decisive when you need to host Llama 3 70B AWQ, run Flux.1 dev at FP16 without quality compromises, do longer-context inference with meaningful batching, or QLoRA-fine-tune anything 32B or larger.
The decision rule is short. If your workload fits comfortably in 16GB, take the 5080: it is faster, more efficient, and has a longer software runway ahead of it. If your workload doesn’t fit in 16GB, take the 3090: it is currently the cheapest path to 24GB of CUDA VRAM in the UK market, and that ceiling is the only thing that matters when the alternative is OOM.
Want to test either card before committing to hardware? Spin up a dedicated RTX 3090 or RTX 5080 instance with gigagpu — UK-hosted, monthly billing, no quotas, full root and CUDA access from day one.