RTX 5080 16GB vs RTX 3090 24GB: Compute or VRAM in 2026? GIGAGPU

Overview: The Core Trade-off

The RTX 5080 16GB and the RTX 3090 24GB sit on opposite sides of a trade-off that defines a lot of 2026 GPU shopping decisions: newer compute or more VRAM? The 5080 is a Blackwell card built on the GB203 die, with 5th-gen tensor cores that include native FP4 support, GDDR7 memory, PCIe Gen 5 and NVENC with AV1 encode. It is roughly 60% faster than the 3090 in raw FP32, and several multiples faster in low-precision tensor throughput thanks to FP8 and FP4 paths the older card simply does not have.

The RTX 3090, by contrast, is an Ampere card on GA102. It launched in 2020 at $1,499, was Nvidia’s first prosumer 24GB GeForce, and despite being two generations old it remains the cheapest sane way to put 24GB of CUDA-compatible VRAM in a workstation. It has no FP8 or FP4 tensor paths, its NVENC is two generations behind, and it is famously hot-running. But that 24GB sticks around as a hard, decisive advantage in workloads where 16GB simply does not fit.

This post puts the two cards side by side with hard numbers across LLM inference, image generation, fine-tuning and price-per-VRAM. The verdict is workload-dependent, and if you are weighing similar trade-offs further up the stack, our RTX 4090 vs RTX 5090 deep-dive covers the same generational question one tier above.

Spec Sheet, Side by Side

Before we get into benchmarks, the headline silicon difference: Blackwell’s much larger L2 cache and faster GDDR7 give the 5080 better effective memory utilisation per gigabyte, while the 3090 simply has more raw capacity. Both cards are built around the same broad CUDA core count and similar TDPs, so the differences come from architecture, not transistor budget.

Specification	RTX 5080 16GB	RTX 3090 24GB
Architecture	Blackwell (GB203)	Ampere (GA102)
Process	TSMC 4NP	Samsung 8N
Launch	January 2025	September 2020
CUDA cores	10,752	10,496
Tensor cores	336 (5th gen, FP4 native)	328 (3rd gen, FP16/INT8)
RT cores	84 (4th gen)	82 (2nd gen)
Base / Boost clock	2.30 / 2.62 GHz	1.40 / 1.70 GHz
VRAM	16 GB GDDR7	24 GB GDDR6X
Memory bus	256-bit	384-bit
Memory bandwidth	960 GB/s	936 GB/s
L2 cache	64 MB	6 MB
TDP	360 W	350 W
PCIe	Gen 5 x16	Gen 4 x16
NVENC	9th gen (AV1 enc/dec)	7th gen (no AV1 encode)
Display	DisplayPort 2.1b UHBR20	DisplayPort 1.4a
Launch price (USD)	~$999	$1,499

The L2 cache jump from 6MB to 64MB is the quiet headline. It dramatically reduces how often Blackwell has to round-trip to GDDR7 during attention and matmul, which is part of why the 5080 punches above its theoretical bandwidth advantage. GDDR7 itself is only ~3% faster than the 3090’s GDDR6X on paper, but the cache hierarchy makes the effective gap much wider in practice.

Raw Compute Comparison

Below are dense (non-sparse) tensor throughput figures published by Nvidia, with FP32 measured on standard CUDA paths. Sparse mode roughly doubles the tensor numbers if your model supports 2:4 structured sparsity, which most production LLMs do not.

Precision	RTX 5080 16GB	RTX 3090 24GB	Speedup (5080)
FP32 (CUDA)	~56 TFLOPs	~35 TFLOPs	1.6x
FP16 / BF16 tensor	~225 TFLOPs	~142 TFLOPs	1.6x
INT8 tensor	~450 TOPS	~284 TOPS	1.6x
FP8 tensor	~450 TFLOPs	Not supported	n/a
FP4 tensor	~900 TFLOPs	Not supported	n/a

If your stack lives in FP16 or BF16, the 5080 is around 1.6x faster than a 3090 on raw matmul, which mirrors the FP32 ratio almost exactly. The bigger story is FP8 and FP4: the 3090 simply cannot run them as native tensor ops, so any framework that targets FP8 (Marlin, vLLM’s FP8 KV cache, TensorRT-LLM’s FP8 paths) gets a free 2x or more on the 5080 with no equivalent on the 3090. Our FP8 Llama deployment guide walks through what FP8 actually does in production and why it matters more than the headline TFLOP number suggests.

That said, raw TFLOPs only matter if the workload is compute-bound. LLM token generation is memory-bound, image generation is mixed, and full fine-tuning is activation-memory-bound. The compute table is the start of the analysis, not the end.

LLM Inference, Per Model

This is where the VRAM gap bites hardest. A dense LLM’s weights have to fit, plus KV cache that grows linearly with batch size and context length, plus framework overhead. The figures below assume vLLM 0.6+ with paged attention, batch size 1 unless stated, 2k context window, and the highest precision the card can run natively.

Model	Quantisation	RTX 5080 16GB	RTX 3090 24GB
Llama 3.1 8B	FP16	~85 tok/s (fits)	~70 tok/s (fits)
Llama 3.1 8B	FP8 (W8A8)	~140 tok/s	Not native
Mistral 7B	FP16	~92 tok/s	~74 tok/s
Llama 3 13B	FP16	OOM (needs ~26GB)	~38 tok/s
Llama 3 13B	AWQ INT4	~95 tok/s	~70 tok/s
Qwen 2.5 32B	AWQ INT4	OOM (16GB too tight)	~28 tok/s (KV pruned)
Llama 3.1 70B	AWQ INT4 (~17GB weights)	OOM with KV cache	~14 tok/s (single user)

The pattern is clean. For 7B-class models the 5080 is roughly 20-25% faster in FP16 and almost 2x faster if you can drop to FP8. For 13B-class models in INT4 both fit but the 5080 is faster. For anything from 32B upward, the 16GB ceiling becomes the deciding factor and the 3090 becomes the only option in this price bracket. Llama 3 70B at INT4 is a particularly notable case because the weights themselves are around 17GB, so even before you allocate any KV cache the 5080 has already overflowed.

If 70B-class hosting is your goal, see our full Llama 3 VRAM requirements breakdown and the AWQ quantisation guide, both of which apply equally to 24GB Ampere and Ada cards. For deployment patterns that maximise tokens per second per pound, our vLLM production setup guide covers the framework choices that make the most of either card.

Image Generation, Per Model

Diffusion is mixed compute and memory pressure. Stable Diffusion XL fits comfortably in either card; Flux.1 dev FP16 does not fit in 16GB without aggressive offloading and quality-degrading quantisation. Times below are end-to-end with default schedulers, single image, 1024×1024, no batching, ComfyUI as the backend.

Workload	RTX 5080 16GB	RTX 3090 24GB
SDXL FP16, 30 steps, 1024²	~6.0 s	~8.2 s
SDXL Turbo FP16, 4 steps	~0.9 s	~1.3 s
Flux.1 dev FP16 (~24GB needed)	OOM without offload	~22 s native
Flux.1 dev Q5 GGUF / FP8	~12 s	~17 s
Flux.1 schnell FP8, 4 steps	~0.5 s	~1.4 s (FP16 fallback)
SD3.5 Large FP16	OOM	~14 s

The 5080 wins SDXL by roughly 25-30% per generation, which compounds quickly across thousands of images. It also wins Flux.1 schnell decisively because schnell only needs four steps and benefits from FP8 inference paths the 3090 cannot use. But Flux.1 dev at full FP16 is a 24GB model in practice, and if you want native fidelity rather than a Q5 GGUF approximation you need the 3090. The same applies to SD3.5 Large in FP16.

This mirrors what we see across Ada and Blackwell more broadly. If you are hesitating between newer 16GB cards and older 24GB cards for diffusion specifically, our RTX 4090 spec breakdown and the RTX 5060 Ti FP8 guide give useful upper and lower bounds on the same architectural trade-off.

Training and Fine-tuning

Training is where activation memory dominates and the 16GB limit really starts to hurt. LoRA is fine on either card because it only updates rank-decomposed adapter weights. Full fine-tuning needs full activations, optimiser states (Adam = 2x weight memory) and gradients (1x weight memory) all in VRAM at once.

Workload	RTX 5080 16GB	RTX 3090 24GB
LoRA fine-tune 7B (rank 16, ~12GB)	Fits, ~1.6x faster wall-clock	Fits
QLoRA 13B/14B (4-bit base)	Fits comfortably	Fits comfortably
QLoRA 32B (4-bit base, ~18GB)	OOM	Fits, ~9 sec/step
QLoRA 70B (4-bit base, ~38GB)	OOM	OOM (needs 48GB+)
Full fine-tune 7B FP16	OOM with optimiser	Borderline, gradient checkpointing required
Full fine-tune 7B BF16 + DeepSpeed Zero-2	OOM	Fits with offload

For LoRA on 7B-class models, take the 5080. The training step is dominated by matmul, the 5080 is roughly 1.6x faster, and the fixed overhead easily fits in 16GB. For QLoRA on 14B and below, either card works; the 5080 will finish faster but the 3090 will not OOM. From 32B upward, the 3090 becomes the only single-GPU option in this bracket. The 70B class needs an A6000 48GB or a multi-GPU setup, which our dedicated GPU hosting page can spec out.

Practical Decision Matrix

The benchmark numbers boil down to a small number of clear use-case wins. If your workload appears here, the answer is unambiguous.

Workload / Use Case	Winner	Why
Single-user chatbot, 7B class	RTX 5080	Faster tokens/sec, FP8 path available
Batched API serving, 7B class	RTX 5080	FP8 throughput dominates batched inference
Hosting Llama 3 70B AWQ for one user	RTX 3090	Only card with VRAM headroom for INT4 weights + KV
Hosting Qwen 2.5 32B	RTX 3090	16GB OOMs with any practical batch size
SDXL image generation	RTX 5080	25-30% faster, AV1 encode for video pipelines
Flux.1 dev FP16 image generation	RTX 3090	Native fit, no quality-loss quantisation needed
Flux.1 schnell FP8 generation	RTX 5080	FP8 path, 4-step schedule, lowest latency
Real-time multimedia (FP4 latency-critical)	RTX 5080	Only card with native FP4 tensor cores
QLoRA fine-tune of 7B / 14B	Either (5080 faster)	Both fit, 5080 finishes ~60% sooner
QLoRA fine-tune of 32B	RTX 3090	4-bit weights need ~18GB, OOMs on 5080
Long-context inference (32k+ tokens)	RTX 3090	KV cache grows linearly with context
Video transcoding pipelines (AV1)	RTX 5080	9th-gen NVENC with AV1 encode, 3090 has none

If your workload spans several rows in this table, the question becomes how often the VRAM ceiling is hit. Most teams underestimate this — KV cache for batch=16 on a 7B model at 8k context is already several gigabytes — so if you are uncertain, err on the side of more VRAM. Our best GPU for LLM inference guide has a deeper sizing methodology.

Power, Noise and Ecosystem Notes

TDP is essentially identical: 360W for the 5080 versus 350W for the 3090. The 5080 is far more efficient per watt because it does more compute per joule, so for the same workload it will draw less average power and finish sooner, idling earlier. The 3090’s GDDR6X modules sit on the back of the PCB on Founders Edition cards and famously hit thermal limits — if you are buying used, prioritise board partner cards (Asus Strix, EVGA FTW3, MSI Suprim) and check memory junction temperatures before deploying.

On the software side, the 5080 has access to DLSS 4 with multi-frame generation, the latest TensorRT-LLM kernels with FP4 paths, current Nvidia drivers without compatibility caveats, and a noticeably lower CUDA driver overhead per launch. The 3090 still receives mainline driver support and CUDA 12.x compatibility, but it is increasingly being positioned as a legacy gaming card by Nvidia, which means optimisation work for new tensor primitives lands on Ada and Blackwell first.

Noise and physical footprint differ too: the 5080 Founders Edition is a 2-slot card, while most 3090 board partner cards are 3-slot or larger. For dense rack deployments this matters, and it is one reason hosting providers prefer the newer cards — but in practice, a properly cooled 3090 in a colocated chassis is still entirely viable in 2026. Our RTX 3090 hosting page lists the chassis configurations we use.

Pricing and Availability in 2026

Pricing is the lever that often decides the whole question. The 5080 retails at around £999 new in the UK, with stock that has historically been tight at launch but has eased in 2026. The 3090 has not been manufactured for years, so its market is entirely used: typical UK prices in mid-2026 sit between £700 and £900 depending on cosmetic condition and warranty.

Card	Typical UK price (2026)	VRAM	Cost per GB VRAM
RTX 5080 16GB (new)	£999	16 GB	£62.4 / GB
RTX 3090 24GB (used, average)	£800	24 GB	£33.3 / GB
RTX 3090 24GB (used, premium board)	£900	24 GB	£37.5 / GB
RTX 4090 24GB (new, where available)	£1,800	24 GB	£75.0 / GB
RTX 5090 32GB (new)	£1,999	32 GB	£62.5 / GB

On a pure cost-per-GB basis the used 3090 wins by a comfortable margin, and that is exactly the trade-off it has occupied since 2022. You are buying VRAM at a discount in exchange for two generations of compute. For workloads where VRAM is binary — either you fit the model or you don’t — that is the right deal. For workloads where compute speed and FP8/FP4 paths matter more than headroom, the 5080 is better value despite the higher per-GB number.

If you want to skip the buying decision entirely and rent rather than own, both cards are available on monthly hosting plans. See our RTX 4090 hosting cost breakdown for the equivalent maths on the next tier up, the cheapest GPU for AI inference for entry-tier options, and the cost per 1M tokens analysis for whether self-hosting beats OpenAI API pricing on your usage profile. Our 4090 vs 5090 decision guide applies the same compute-versus-VRAM analysis one tier up.

Verdict

The RTX 5080 wins on newer codecs, FP4 and FP8 throughput, better single-stream latency for 7B-class models, and lower power per workload. The RTX 3090 wins on raw VRAM — and that is decisive when you need to host Llama 3 70B AWQ, run Flux.1 dev at FP16 without quality compromises, do longer-context inference with meaningful batching, or QLoRA-fine-tune anything 32B or larger.

The decision rule is short. If your workload fits comfortably in 16GB, take the 5080: it is faster, more efficient, and has a longer software runway ahead of it. If your workload doesn’t fit in 16GB, take the 3090: it is currently the cheapest path to 24GB of CUDA VRAM in the UK market, and that ceiling is the only thing that matters when the alternative is OOM.

Want to test either card before committing to hardware? Spin up a dedicated RTX 3090 or RTX 5080 instance with gigagpu — UK-hosted, monthly billing, no quotas, full root and CUDA access from day one.

RTX 5080 16GB vs RTX 3090 24GB: Compute or VRAM in 2026?

Table of Contents

Overview: The Core Trade-off

Spec Sheet, Side by Side

Raw Compute Comparison

LLM Inference, Per Model

Image Generation, Per Model

Training and Fine-tuning

Practical Decision Matrix

Power, Noise and Ecosystem Notes

Pricing and Availability in 2026

Verdict

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 5080 16GB vs RTX 3090 24GB: Compute or VRAM in 2026?

Table of Contents

Overview: The Core Trade-off

Spec Sheet, Side by Side

Raw Compute Comparison

LLM Inference, Per Model

Image Generation, Per Model

Training and Fine-tuning

Practical Decision Matrix

Power, Noise and Ecosystem Notes

Pricing and Availability in 2026

Verdict

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 5090 vs RTX 3090 for AI Inference: Five Generations of Difference, 4× the VRAM Bandwidth

LLaMA 3 70B vs Mixtral 8x7B for Code Generation: GPU Benchmark

Can RTX 3090 Run CodeLlama 34B?

24GB vs 16GB vs 8GB VRAM: Which Do You Need for AI?

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?