RTX 3050 - Order Now
Home / Blog / GPU Comparisons / RTX 5080 16GB vs RTX 3090 24GB: Compute or VRAM in 2026?
GPU Comparisons

RTX 5080 16GB vs RTX 3090 24GB: Compute or VRAM in 2026?

RTX 5080 16GB vs RTX 3090 24GB in 2026: FP4/FP8 throughput against raw VRAM. Hard numbers, per-model benchmarks and an honest verdict.

Overview: The Core Trade-off

The RTX 5080 16GB and the RTX 3090 24GB sit on opposite sides of a trade-off that defines a lot of 2026 GPU shopping decisions: newer compute or more VRAM? The 5080 is a Blackwell card built on the GB203 die, with 5th-gen tensor cores that include native FP4 support, GDDR7 memory, PCIe Gen 5 and NVENC with AV1 encode. It is roughly 60% faster than the 3090 in raw FP32, and several multiples faster in low-precision tensor throughput thanks to FP8 and FP4 paths the older card simply does not have.

The RTX 3090, by contrast, is an Ampere card on GA102. It launched in 2020 at $1,499, was Nvidia’s first prosumer 24GB GeForce, and despite being two generations old it remains the cheapest sane way to put 24GB of CUDA-compatible VRAM in a workstation. It has no FP8 or FP4 tensor paths, its NVENC is two generations behind, and it is famously hot-running. But that 24GB sticks around as a hard, decisive advantage in workloads where 16GB simply does not fit.

This post puts the two cards side by side with hard numbers across LLM inference, image generation, fine-tuning and price-per-VRAM. The verdict is workload-dependent, and if you are weighing similar trade-offs further up the stack, our RTX 4090 vs RTX 5090 deep-dive covers the same generational question one tier above.

Spec Sheet, Side by Side

Before we get into benchmarks, the headline silicon difference: Blackwell’s much larger L2 cache and faster GDDR7 give the 5080 better effective memory utilisation per gigabyte, while the 3090 simply has more raw capacity. Both cards are built around the same broad CUDA core count and similar TDPs, so the differences come from architecture, not transistor budget.

SpecificationRTX 5080 16GBRTX 3090 24GB
ArchitectureBlackwell (GB203)Ampere (GA102)
ProcessTSMC 4NPSamsung 8N
LaunchJanuary 2025September 2020
CUDA cores10,75210,496
Tensor cores336 (5th gen, FP4 native)328 (3rd gen, FP16/INT8)
RT cores84 (4th gen)82 (2nd gen)
Base / Boost clock2.30 / 2.62 GHz1.40 / 1.70 GHz
VRAM16 GB GDDR724 GB GDDR6X
Memory bus256-bit384-bit
Memory bandwidth960 GB/s936 GB/s
L2 cache64 MB6 MB
TDP360 W350 W
PCIeGen 5 x16Gen 4 x16
NVENC9th gen (AV1 enc/dec)7th gen (no AV1 encode)
DisplayDisplayPort 2.1b UHBR20DisplayPort 1.4a
Launch price (USD)~$999$1,499

The L2 cache jump from 6MB to 64MB is the quiet headline. It dramatically reduces how often Blackwell has to round-trip to GDDR7 during attention and matmul, which is part of why the 5080 punches above its theoretical bandwidth advantage. GDDR7 itself is only ~3% faster than the 3090’s GDDR6X on paper, but the cache hierarchy makes the effective gap much wider in practice.

Raw Compute Comparison

Below are dense (non-sparse) tensor throughput figures published by Nvidia, with FP32 measured on standard CUDA paths. Sparse mode roughly doubles the tensor numbers if your model supports 2:4 structured sparsity, which most production LLMs do not.

PrecisionRTX 5080 16GBRTX 3090 24GBSpeedup (5080)
FP32 (CUDA)~56 TFLOPs~35 TFLOPs1.6x
FP16 / BF16 tensor~225 TFLOPs~142 TFLOPs1.6x
INT8 tensor~450 TOPS~284 TOPS1.6x
FP8 tensor~450 TFLOPsNot supportedn/a
FP4 tensor~900 TFLOPsNot supportedn/a

If your stack lives in FP16 or BF16, the 5080 is around 1.6x faster than a 3090 on raw matmul, which mirrors the FP32 ratio almost exactly. The bigger story is FP8 and FP4: the 3090 simply cannot run them as native tensor ops, so any framework that targets FP8 (Marlin, vLLM’s FP8 KV cache, TensorRT-LLM’s FP8 paths) gets a free 2x or more on the 5080 with no equivalent on the 3090. Our FP8 Llama deployment guide walks through what FP8 actually does in production and why it matters more than the headline TFLOP number suggests.

That said, raw TFLOPs only matter if the workload is compute-bound. LLM token generation is memory-bound, image generation is mixed, and full fine-tuning is activation-memory-bound. The compute table is the start of the analysis, not the end.

LLM Inference, Per Model

This is where the VRAM gap bites hardest. A dense LLM’s weights have to fit, plus KV cache that grows linearly with batch size and context length, plus framework overhead. The figures below assume vLLM 0.6+ with paged attention, batch size 1 unless stated, 2k context window, and the highest precision the card can run natively.

ModelQuantisationRTX 5080 16GBRTX 3090 24GB
Llama 3.1 8BFP16~85 tok/s (fits)~70 tok/s (fits)
Llama 3.1 8BFP8 (W8A8)~140 tok/sNot native
Mistral 7BFP16~92 tok/s~74 tok/s
Llama 3 13BFP16OOM (needs ~26GB)~38 tok/s
Llama 3 13BAWQ INT4~95 tok/s~70 tok/s
Qwen 2.5 32BAWQ INT4OOM (16GB too tight)~28 tok/s (KV pruned)
Llama 3.1 70BAWQ INT4 (~17GB weights)OOM with KV cache~14 tok/s (single user)

The pattern is clean. For 7B-class models the 5080 is roughly 20-25% faster in FP16 and almost 2x faster if you can drop to FP8. For 13B-class models in INT4 both fit but the 5080 is faster. For anything from 32B upward, the 16GB ceiling becomes the deciding factor and the 3090 becomes the only option in this price bracket. Llama 3 70B at INT4 is a particularly notable case because the weights themselves are around 17GB, so even before you allocate any KV cache the 5080 has already overflowed.

If 70B-class hosting is your goal, see our full Llama 3 VRAM requirements breakdown and the AWQ quantisation guide, both of which apply equally to 24GB Ampere and Ada cards. For deployment patterns that maximise tokens per second per pound, our vLLM production setup guide covers the framework choices that make the most of either card.

Image Generation, Per Model

Diffusion is mixed compute and memory pressure. Stable Diffusion XL fits comfortably in either card; Flux.1 dev FP16 does not fit in 16GB without aggressive offloading and quality-degrading quantisation. Times below are end-to-end with default schedulers, single image, 1024×1024, no batching, ComfyUI as the backend.

WorkloadRTX 5080 16GBRTX 3090 24GB
SDXL FP16, 30 steps, 1024²~6.0 s~8.2 s
SDXL Turbo FP16, 4 steps~0.9 s~1.3 s
Flux.1 dev FP16 (~24GB needed)OOM without offload~22 s native
Flux.1 dev Q5 GGUF / FP8~12 s~17 s
Flux.1 schnell FP8, 4 steps~0.5 s~1.4 s (FP16 fallback)
SD3.5 Large FP16OOM~14 s

The 5080 wins SDXL by roughly 25-30% per generation, which compounds quickly across thousands of images. It also wins Flux.1 schnell decisively because schnell only needs four steps and benefits from FP8 inference paths the 3090 cannot use. But Flux.1 dev at full FP16 is a 24GB model in practice, and if you want native fidelity rather than a Q5 GGUF approximation you need the 3090. The same applies to SD3.5 Large in FP16.

This mirrors what we see across Ada and Blackwell more broadly. If you are hesitating between newer 16GB cards and older 24GB cards for diffusion specifically, our RTX 4090 spec breakdown and the RTX 5060 Ti FP8 guide give useful upper and lower bounds on the same architectural trade-off.

Training and Fine-tuning

Training is where activation memory dominates and the 16GB limit really starts to hurt. LoRA is fine on either card because it only updates rank-decomposed adapter weights. Full fine-tuning needs full activations, optimiser states (Adam = 2x weight memory) and gradients (1x weight memory) all in VRAM at once.

WorkloadRTX 5080 16GBRTX 3090 24GB
LoRA fine-tune 7B (rank 16, ~12GB)Fits, ~1.6x faster wall-clockFits
QLoRA 13B/14B (4-bit base)Fits comfortablyFits comfortably
QLoRA 32B (4-bit base, ~18GB)OOMFits, ~9 sec/step
QLoRA 70B (4-bit base, ~38GB)OOMOOM (needs 48GB+)
Full fine-tune 7B FP16OOM with optimiserBorderline, gradient checkpointing required
Full fine-tune 7B BF16 + DeepSpeed Zero-2OOMFits with offload

For LoRA on 7B-class models, take the 5080. The training step is dominated by matmul, the 5080 is roughly 1.6x faster, and the fixed overhead easily fits in 16GB. For QLoRA on 14B and below, either card works; the 5080 will finish faster but the 3090 will not OOM. From 32B upward, the 3090 becomes the only single-GPU option in this bracket. The 70B class needs an A6000 48GB or a multi-GPU setup, which our dedicated GPU hosting page can spec out.

Practical Decision Matrix

The benchmark numbers boil down to a small number of clear use-case wins. If your workload appears here, the answer is unambiguous.

Workload / Use CaseWinnerWhy
Single-user chatbot, 7B classRTX 5080Faster tokens/sec, FP8 path available
Batched API serving, 7B classRTX 5080FP8 throughput dominates batched inference
Hosting Llama 3 70B AWQ for one userRTX 3090Only card with VRAM headroom for INT4 weights + KV
Hosting Qwen 2.5 32BRTX 309016GB OOMs with any practical batch size
SDXL image generationRTX 508025-30% faster, AV1 encode for video pipelines
Flux.1 dev FP16 image generationRTX 3090Native fit, no quality-loss quantisation needed
Flux.1 schnell FP8 generationRTX 5080FP8 path, 4-step schedule, lowest latency
Real-time multimedia (FP4 latency-critical)RTX 5080Only card with native FP4 tensor cores
QLoRA fine-tune of 7B / 14BEither (5080 faster)Both fit, 5080 finishes ~60% sooner
QLoRA fine-tune of 32BRTX 30904-bit weights need ~18GB, OOMs on 5080
Long-context inference (32k+ tokens)RTX 3090KV cache grows linearly with context
Video transcoding pipelines (AV1)RTX 50809th-gen NVENC with AV1 encode, 3090 has none

If your workload spans several rows in this table, the question becomes how often the VRAM ceiling is hit. Most teams underestimate this — KV cache for batch=16 on a 7B model at 8k context is already several gigabytes — so if you are uncertain, err on the side of more VRAM. Our best GPU for LLM inference guide has a deeper sizing methodology.

Power, Noise and Ecosystem Notes

TDP is essentially identical: 360W for the 5080 versus 350W for the 3090. The 5080 is far more efficient per watt because it does more compute per joule, so for the same workload it will draw less average power and finish sooner, idling earlier. The 3090’s GDDR6X modules sit on the back of the PCB on Founders Edition cards and famously hit thermal limits — if you are buying used, prioritise board partner cards (Asus Strix, EVGA FTW3, MSI Suprim) and check memory junction temperatures before deploying.

On the software side, the 5080 has access to DLSS 4 with multi-frame generation, the latest TensorRT-LLM kernels with FP4 paths, current Nvidia drivers without compatibility caveats, and a noticeably lower CUDA driver overhead per launch. The 3090 still receives mainline driver support and CUDA 12.x compatibility, but it is increasingly being positioned as a legacy gaming card by Nvidia, which means optimisation work for new tensor primitives lands on Ada and Blackwell first.

Noise and physical footprint differ too: the 5080 Founders Edition is a 2-slot card, while most 3090 board partner cards are 3-slot or larger. For dense rack deployments this matters, and it is one reason hosting providers prefer the newer cards — but in practice, a properly cooled 3090 in a colocated chassis is still entirely viable in 2026. Our RTX 3090 hosting page lists the chassis configurations we use.

Pricing and Availability in 2026

Pricing is the lever that often decides the whole question. The 5080 retails at around £999 new in the UK, with stock that has historically been tight at launch but has eased in 2026. The 3090 has not been manufactured for years, so its market is entirely used: typical UK prices in mid-2026 sit between £700 and £900 depending on cosmetic condition and warranty.

CardTypical UK price (2026)VRAMCost per GB VRAM
RTX 5080 16GB (new)£99916 GB£62.4 / GB
RTX 3090 24GB (used, average)£80024 GB£33.3 / GB
RTX 3090 24GB (used, premium board)£90024 GB£37.5 / GB
RTX 4090 24GB (new, where available)£1,80024 GB£75.0 / GB
RTX 5090 32GB (new)£1,99932 GB£62.5 / GB

On a pure cost-per-GB basis the used 3090 wins by a comfortable margin, and that is exactly the trade-off it has occupied since 2022. You are buying VRAM at a discount in exchange for two generations of compute. For workloads where VRAM is binary — either you fit the model or you don’t — that is the right deal. For workloads where compute speed and FP8/FP4 paths matter more than headroom, the 5080 is better value despite the higher per-GB number.

If you want to skip the buying decision entirely and rent rather than own, both cards are available on monthly hosting plans. See our RTX 4090 hosting cost breakdown for the equivalent maths on the next tier up, the cheapest GPU for AI inference for entry-tier options, and the cost per 1M tokens analysis for whether self-hosting beats OpenAI API pricing on your usage profile. Our 4090 vs 5090 decision guide applies the same compute-versus-VRAM analysis one tier up.

Verdict

The RTX 5080 wins on newer codecs, FP4 and FP8 throughput, better single-stream latency for 7B-class models, and lower power per workload. The RTX 3090 wins on raw VRAM — and that is decisive when you need to host Llama 3 70B AWQ, run Flux.1 dev at FP16 without quality compromises, do longer-context inference with meaningful batching, or QLoRA-fine-tune anything 32B or larger.

The decision rule is short. If your workload fits comfortably in 16GB, take the 5080: it is faster, more efficient, and has a longer software runway ahead of it. If your workload doesn’t fit in 16GB, take the 3090: it is currently the cheapest path to 24GB of CUDA VRAM in the UK market, and that ceiling is the only thing that matters when the alternative is OOM.

Want to test either card before committing to hardware? Spin up a dedicated RTX 3090 or RTX 5080 instance with gigagpu — UK-hosted, monthly billing, no quotas, full root and CUDA access from day one.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?