RTX 3050 - Order Now
Home / Blog / GPU Comparisons / RTX 3090 vs RTX 5090 for LLM Inference: Ampere vs Blackwell in 2026
GPU Comparisons

RTX 3090 vs RTX 5090 for LLM Inference: Ampere vs Blackwell in 2026

Full head-to-head of the RTX 3090 24GB and RTX 5090 32GB for LLM inference: bandwidth, FP8, tokens per watt and price performance.

The RTX 3090 has been the budget LLM workhorse for five years. The RTX 5090 arrived in 2025 with native FP8, nearly double the memory bandwidth and 8 GB more VRAM. The question is whether the 5090 earns its price premium when you are running Llama or Qwen in vLLM, or whether the 3090 is still the right pick. This piece compares the two end-to-end. For either card on dedicated hardware, see our dedicated GPU hosting.

Contents

Spec delta

SpecRTX 3090RTX 5090Delta
ArchitectureAmpere (GA102)Blackwell (GB202)
CUDA cores10,49621,760+107%
VRAM24 GB GDDR6X32 GB GDDR7+33%
Memory bandwidth936 GB/s1,792 GB/s+92%
TDP350 W575 W+64%
FP16 TFLOPS71209+195%
Native FP8 tensorNoYes (E4M3/E5M2)
NVENCGen 7Gen 9

Ampere vs Blackwell

For LLM inference the single biggest upgrade is FP8. On Ampere you are forced to choose between FP16 (accurate, slow) and INT8 (faster, requires calibration and can lose quality). Blackwell’s native FP8 is effectively free: within 0.5% of FP16 perplexity on Llama 3.1 8B and 70B, at roughly 1.8x the throughput. Bandwidth is the other major win – LLM decode is memory-bound, and 1.8 TB/s vs 936 GB/s is very close to the theoretical speedup ceiling.

Throughput by model (vLLM 0.8, batch 16)

ModelPrecisionRTX 3090 tok/sRTX 5090 tok/sSpeedup
Llama 3.1 8BFP161,8503,4201.85x
Llama 3.1 8BFP8 / INT82,4105,9802.48x
Qwen2.5 14BFP169801,9201.96x
Mistral 7BFP8 / INT82,6206,3102.41x
Qwen2.5 32BINT44208802.10x
Llama 3.1 70BINT4OOM batch 16140 (batch 4)

Llama 70B is tight on a 3090 even at INT4; a 5090 fits at INT4 with modest context. For similar numbers on the smaller 5060 Ti, see the FP8 Llama deployment guide and 5060 Ti vs 3090.

Tokens per watt

Power is a real cost in the UK. At 575 W the 5090 draws more than the 3090 but gets disproportionately more work done per watt.

Model3090 tok/watt5090 tok/wattDelta
Llama 3.1 8B FP165.35.9+11%
Llama 3.1 8B FP86.910.4+51%
Qwen2.5 14B FP162.83.3+18%
Mistral 7B FP87.511.0+47%

Price per performance

A used RTX 3090 is around £700-£780 on the UK second-hand market. A new RTX 5090 is £2,000-£2,300. On dedicated hosting the gap is smaller: a 3090 server runs ~£350/mo, a 5090 server ~£800-£900/mo. Normalise to tokens per pound and the 5090 wins in every FP8 workload and breaks even in FP16.

Which to pick

  • Choose 3090 if: FP16 serving only, budget-bound, workload is dominated by 7B models, you already own the card.
  • Choose 5090 if: you want FP8, you need 32 GB for 14B-32B models at full precision, energy efficiency matters, you plan to run 70B at INT4.

Deploy on a 5090 or 3090 server

Ampere workhorse or Blackwell flagship, your call. UK dedicated hosting.

Browse GPU Servers

See also: 5060 Ti vs 3090, Upgrading from 5060 Ti to 5090, FP8 Llama deployment, Tokens per watt, ROCm vs CUDA.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?