The RTX 3090 has been the budget LLM workhorse for five years. The RTX 5090 arrived in 2025 with native FP8, nearly double the memory bandwidth and 8 GB more VRAM. The question is whether the 5090 earns its price premium when you are running Llama or Qwen in vLLM, or whether the 3090 is still the right pick. This piece compares the two end-to-end. For either card on dedicated hardware, see our dedicated GPU hosting.
Contents
- Spec delta
- Ampere vs Blackwell
- Throughput by model
- Tokens per watt
- Price per performance
- Which to pick
Spec delta
| Spec | RTX 3090 | RTX 5090 | Delta |
|---|---|---|---|
| Architecture | Ampere (GA102) | Blackwell (GB202) | – |
| CUDA cores | 10,496 | 21,760 | +107% |
| VRAM | 24 GB GDDR6X | 32 GB GDDR7 | +33% |
| Memory bandwidth | 936 GB/s | 1,792 GB/s | +92% |
| TDP | 350 W | 575 W | +64% |
| FP16 TFLOPS | 71 | 209 | +195% |
| Native FP8 tensor | No | Yes (E4M3/E5M2) | – |
| NVENC | Gen 7 | Gen 9 | – |
Ampere vs Blackwell
For LLM inference the single biggest upgrade is FP8. On Ampere you are forced to choose between FP16 (accurate, slow) and INT8 (faster, requires calibration and can lose quality). Blackwell’s native FP8 is effectively free: within 0.5% of FP16 perplexity on Llama 3.1 8B and 70B, at roughly 1.8x the throughput. Bandwidth is the other major win – LLM decode is memory-bound, and 1.8 TB/s vs 936 GB/s is very close to the theoretical speedup ceiling.
Throughput by model (vLLM 0.8, batch 16)
| Model | Precision | RTX 3090 tok/s | RTX 5090 tok/s | Speedup |
|---|---|---|---|---|
| Llama 3.1 8B | FP16 | 1,850 | 3,420 | 1.85x |
| Llama 3.1 8B | FP8 / INT8 | 2,410 | 5,980 | 2.48x |
| Qwen2.5 14B | FP16 | 980 | 1,920 | 1.96x |
| Mistral 7B | FP8 / INT8 | 2,620 | 6,310 | 2.41x |
| Qwen2.5 32B | INT4 | 420 | 880 | 2.10x |
| Llama 3.1 70B | INT4 | OOM batch 16 | 140 (batch 4) | – |
Llama 70B is tight on a 3090 even at INT4; a 5090 fits at INT4 with modest context. For similar numbers on the smaller 5060 Ti, see the FP8 Llama deployment guide and 5060 Ti vs 3090.
Tokens per watt
Power is a real cost in the UK. At 575 W the 5090 draws more than the 3090 but gets disproportionately more work done per watt.
| Model | 3090 tok/watt | 5090 tok/watt | Delta |
|---|---|---|---|
| Llama 3.1 8B FP16 | 5.3 | 5.9 | +11% |
| Llama 3.1 8B FP8 | 6.9 | 10.4 | +51% |
| Qwen2.5 14B FP16 | 2.8 | 3.3 | +18% |
| Mistral 7B FP8 | 7.5 | 11.0 | +47% |
Price per performance
A used RTX 3090 is around £700-£780 on the UK second-hand market. A new RTX 5090 is £2,000-£2,300. On dedicated hosting the gap is smaller: a 3090 server runs ~£350/mo, a 5090 server ~£800-£900/mo. Normalise to tokens per pound and the 5090 wins in every FP8 workload and breaks even in FP16.
Which to pick
- Choose 3090 if: FP16 serving only, budget-bound, workload is dominated by 7B models, you already own the card.
- Choose 5090 if: you want FP8, you need 32 GB for 14B-32B models at full precision, energy efficiency matters, you plan to run 70B at INT4.
Deploy on a 5090 or 3090 server
Ampere workhorse or Blackwell flagship, your call. UK dedicated hosting.
Browse GPU ServersSee also: 5060 Ti vs 3090, Upgrading from 5060 Ti to 5090, FP8 Llama deployment, Tokens per watt, ROCm vs CUDA.