Table of Contents
The RTX 3090 launched in September 2020 with 24 GB of GDDR6X — the first consumer card with enough VRAM to run a 13B model in FP16. The RTX 5090 launched in early 2025 with 32 GB of GDDR7, hardware FP4, and roughly 3× the FP16 throughput. Five years between them. This is the comparison guide we use when customers are choosing between the cheapest 24 GB card we host and the flagship Blackwell GPU.
The RTX 5090 is ~1.6× faster than the 3090 on FP16 LLM inference, ~2× faster on FP8/FP4, and has 33% more VRAM (32 GB vs 24 GB). The RTX 3090 is half the price (£159/mo vs £159/mo). For most production workloads the 5090 wins on cost-per-token; for budget-constrained deployments the 3090 is still the cheapest 24 GB GPU we rent.
Specs side-by-side
| Spec | RTX 3090 | RTX 5090 | Delta |
|---|---|---|---|
| Architecture | Ampere (GA102) | Blackwell (GB202) | 5 gens |
| VRAM | 24 GB GDDR6X | 32 GB GDDR7 | +33% |
| Memory bandwidth | 936 GB/s | 1,792 GB/s | +91% |
| CUDA cores | 10,496 | 21,760 | +107% |
| Tensor cores | 328 (3rd gen) | 680 (5th gen) | 2.07× |
| FP16 compute | ~36 TFLOPS | ~105 TFLOPS | 2.92× |
| FP8 throughput | n/a (software) | ~838 TOPS | ∞ |
| FP4 throughput | n/a | ~1,676 TOPS | ∞ |
| TDP | 350 W | 575 W | +64% |
| Launch year | 2020 | 2025 | +5 years |
| GigaGPU monthly | £159 | £399 | 2.0× |
VRAM: 24 GB vs 32 GB matters more than the number suggests
Headline: 33% more VRAM. In practice, the 8 GB delta is often the difference between "fits" and "doesn’t fit" for the models people actually run today:
- Mistral 7B FP16 with 32K context — 14 GB weights + 4 GB KV cache + 2 GB activations + a second model alongside. 24 GB tight, 32 GB comfortable.
- Llama 3 8B Vision — ~22 GB total. Fits the 3090, fits the 5090, but with much less headroom on the 3090.
- Qwen 2.5 14B FP16 — 28 GB. Doesn’t fit 24 GB. Fits 32 GB.
- FLUX.1 dev FP16 — 24 GB peak. Tight on 3090; comfortable on 5090.
- Mixtral 8x7B INT4 — 26 GB. Doesn’t fit 3090. Fits 5090.
The 8 GB delta puts the 5090 over the threshold for several genuinely common production deployments. For 7B chatbots either card works fine.
Compute: 5 generations of architectural improvement
The 5090’s tensor cores are 5th gen (Blackwell). The 3090’s are 3rd gen (Ampere). Concretely:
- FP16 dense compute is ~3× higher
- FP8 hardware path is brand new — Ampere does FP8 in software via emulation, ~5× slower
- FP4 (NVFP4 / MX-FP4) is exclusive to Blackwell
- Memory bandwidth is ~2× higher (GDDR7 vs GDDR6X)
For real workloads, the FP8 path is the more important difference than the FP16 numbers. Production inference is shifting toward FP8 because the quality regression is <1% and the throughput jump is real. The 5090 does FP8 in hardware. The 3090 cannot.
Real benchmarks — LLM, image, speech
vLLM 0.6.3, 50-thread Locust, Ubuntu 22.04, NVIDIA driver 555.x (5090) / 535.x (3090).
| Workload | RTX 3090 | RTX 5090 | Speedup |
|---|---|---|---|
| Mistral 7B FP16 — aggregate tok/s | 720 | 1,180 | 1.64× |
| Mistral 7B FP8 — aggregate tok/s | n/a | 1,920 | ∞ (no FP8) |
| Llama 3 8B FP16 — aggregate tok/s | 680 | 1,140 | 1.68× |
| Llama 3 8B FP8 — aggregate tok/s | n/a | 1,820 | ∞ |
| Qwen 2.5 14B FP16 — aggregate tok/s | OOM | 720 | ∞ |
| Qwen 2.5 14B INT4 — aggregate tok/s | 410 | 880 | 2.15× |
| SDXL 1024² — seconds/image | 14 s | 6 s | 2.33× |
| FLUX.1 dev 1024² FP16 — s/image | 14 s | 8 s | 1.75× |
| FLUX.1 dev 1024² FP8 — s/image | n/a | 6 s | ∞ |
| Whisper Large-v3 — RTF | 6× | 9× | 1.5× |
| SDXL Turbo 1024² — s/image | 1.1 s | 0.6 s | 1.83× |
Cost-per-token math
Mistral 7B FP16, 60% utilisation, 30-day month:
- RTX 3090: 720 tok/s × 60% × 30 days × 86400 s = ~1.12B tokens/month at £399/mo = £0.18 / 1M tokens
- RTX 5090 FP16: 1,180 tok/s × 60% × 30 days × 86400 s = ~1.83B tokens/month at £399/mo = £0.22 / 1M tokens
- RTX 5090 FP8: 1,920 tok/s × 60% × 30 days × 86400 s = ~2.99B tokens/month at £399/mo = £0.11 / 1M tokens
The 3090 is actually slightly cheaper per token at FP16 because the price ratio (2.0×) is greater than the throughput ratio (1.64×). But the 5090 wins decisively at FP8 — and FP8 is essentially free quality-wise.
When the 3090 is still the right pick
- Single-stream chatbot, low concurrency (<10 simultaneous users) — 3090 throughput is plenty.
- Whisper-only or embedding-only deployments — both are tiny, the 5090 is over-spec’d.
- Hard cost cap — £159/mo vs £399/mo is a real difference for hobby projects, internal tools, MVPs.
- Older models that don’t have FP8 ports — Code Llama 13B, Llama 2, etc.
- Fine-tuning where 24 GB is enough — most LoRA workloads on 7B-13B models.
Verdict
For a new production deployment in 2026, the RTX 5090 is the right card — better cost-per-token at FP8, more VRAM headroom, futureproof on quantisation. The RTX 3090 remains the cheapest 24 GB GPU we host, and is the right pick for budget-constrained, single-stream, FP16-only workloads.
Bottom line
Picking between them comes down to FP8: if you can run FP8, take the 5090. If your stack is locked to FP16 and you don’t need 32 GB, the 3090 is genuinely cheaper. For anything bigger than 14B FP16 or 32B INT4, neither is enough — see multi-GPU clusters or the RTX 6000 Pro 96 GB.