Home / Blog / GPU Comparisons / RTX 3090 vs RTX 5090 for LLM Inference: Ampere vs Blackwell in 2026

GPU Comparisons

RTX 3090 vs RTX 5090 for LLM Inference: Ampere vs Blackwell in 2026

Full head-to-head of the RTX 3090 24GB and RTX 5090 32GB for LLM inference: bandwidth, FP8, tokens per watt and price performance.

GPU Comparisons April 23, 2026 2 min read admin

The RTX 3090 has been the budget LLM workhorse for five years. The RTX 5090 arrived in 2025 with native FP8, nearly double the memory bandwidth and 8 GB more VRAM. The question is whether the 5090 earns its price premium when you are running Llama or Qwen in vLLM, or whether the 3090 is still the right pick. This piece compares the two end-to-end. For either card on dedicated hardware, see our dedicated GPU hosting.

Spec delta
Ampere vs Blackwell
Throughput by model
Tokens per watt
Price per performance
Which to pick

Spec delta

Spec	RTX 3090	RTX 5090	Delta
Architecture	Ampere (GA102)	Blackwell (GB202)	–
CUDA cores	10,496	21,760	+107%
VRAM	24 GB GDDR6X	32 GB GDDR7	+33%
Memory bandwidth	936 GB/s	1,792 GB/s	+92%
TDP	350 W	575 W	+64%
FP16 TFLOPS	71	209	+195%
Native FP8 tensor	No	Yes (E4M3/E5M2)	–
NVENC	Gen 7	Gen 9	–

Ampere vs Blackwell

For LLM inference the single biggest upgrade is FP8. On Ampere you are forced to choose between FP16 (accurate, slow) and INT8 (faster, requires calibration and can lose quality). Blackwell’s native FP8 is effectively free: within 0.5% of FP16 perplexity on Llama 3.1 8B and 70B, at roughly 1.8x the throughput. Bandwidth is the other major win – LLM decode is memory-bound, and 1.8 TB/s vs 936 GB/s is very close to the theoretical speedup ceiling.

Throughput by model (vLLM 0.8, batch 16)

Model	Precision	RTX 3090 tok/s	RTX 5090 tok/s	Speedup
Llama 3.1 8B	FP16	1,850	3,420	1.85x
Llama 3.1 8B	FP8 / INT8	2,410	5,980	2.48x
Qwen2.5 14B	FP16	980	1,920	1.96x
Mistral 7B	FP8 / INT8	2,620	6,310	2.41x
Qwen2.5 32B	INT4	420	880	2.10x
Llama 3.1 70B	INT4	OOM batch 16	140 (batch 4)	–

Llama 70B is tight on a 3090 even at INT4; a 5090 fits at INT4 with modest context. For similar numbers on the smaller 5060 Ti, see the FP8 Llama deployment guide and 5060 Ti vs 3090.

Tokens per watt

Power is a real cost in the UK. At 575 W the 5090 draws more than the 3090 but gets disproportionately more work done per watt.

Model	3090 tok/watt	5090 tok/watt	Delta
Llama 3.1 8B FP16	5.3	5.9	+11%
Llama 3.1 8B FP8	6.9	10.4	+51%
Qwen2.5 14B FP16	2.8	3.3	+18%
Mistral 7B FP8	7.5	11.0	+47%

Price per performance

A used RTX 3090 is around £700-£780 on the UK second-hand market. A new RTX 5090 is £2,000-£2,300. On dedicated hosting the gap is smaller: a 3090 server runs ~£350/mo, a 5090 server ~£800-£900/mo. Normalise to tokens per pound and the 5090 wins in every FP8 workload and breaks even in FP16.

Which to pick

Choose 3090 if: FP16 serving only, budget-bound, workload is dominated by 7B models, you already own the card.
Choose 5090 if: you want FP8, you need 32 GB for 14B-32B models at full precision, energy efficiency matters, you plan to run 70B at INT4.

Deploy on a 5090 or 3090 server

Ampere workhorse or Blackwell flagship, your call. UK dedicated hosting.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

GPU Comparisons

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RTX 3090 vs RTX 5090 for LLM Inference: Ampere vs Blackwell in 2026

Contents

Spec delta

Ampere vs Blackwell

Throughput by model (vLLM 0.8, batch 16)

Tokens per watt

Price per performance

Which to pick

Deploy on a 5090 or 3090 server

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 3090 vs RTX 5090 for LLM Inference: Ampere vs Blackwell in 2026

Contents

Spec delta

Ampere vs Blackwell

Throughput by model (vLLM 0.8, batch 16)

Tokens per watt

Price per performance

Which to pick

Deploy on a 5090 or 3090 server

Need a Dedicated GPU Server?

admin

Related Articles

LLaMA 3 70B vs Mixtral 8x7B for Cost-Optimised Batch Processing: GPU Benchmark

LLaMA 3 70B vs Qwen 72B for API Serving (Throughput): GPU Benchmark

Mistral 7B vs Qwen 2.5 7B for Chatbot / Conversational AI: GPU Benchmark

LLaMA 3 8B vs Mistral 7B for Document Processing / RAG: GPU Benchmark

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?