Table of Contents
The Case for 32GB Blackwell
The RTX 3090 with 24GB GDDR6X has been the gold standard for self-hosted AI. The RTX 5090 is its clear successor: 32GB GDDR7 at approximately 1,792 GB/s, Blackwell tensor cores with native FP4, and significantly more compute. On a dedicated GPU server, the 5090 does everything the 3090 does — faster — while adding an entire tier of models that the 3090 cannot fit.
This is the most straightforward GPU upgrade in the current lineup. You keep all your existing capabilities and gain new ones. The only question is whether the cost difference justifies it for your workload. For budget-conscious alternatives, see the RTX 3090 to RTX 5080 analysis.
Spec Comparison: 3090 vs 5090
| Specification | RTX 3090 | RTX 5090 | Improvement |
|---|---|---|---|
| VRAM | 24 GB GDDR6X | 32 GB GDDR7 | +33% |
| Bandwidth | 936 GB/s | ~1,792 GB/s | +91% |
| Architecture | Ampere | Blackwell | 2 generations newer |
| FP4 Tensor | No | Yes | New capability |
| Power | 350W | ~450W | +29% power |
| Tensor TFLOPS | ~142 (FP16) | ~380+ (FP16) | ~2.7x |
The bandwidth nearly doubles. Since LLM token generation is memory-bandwidth-bound, this directly translates to nearly 2x faster single-user inference across every model size.
Performance Gains Across Workloads
| Workload | RTX 3090 | RTX 5090 | Speedup |
|---|---|---|---|
| Llama 3 8B FP16 (tok/s) | ~55 | ~115 | 2.1x |
| Llama 3 8B Q4 (tok/s) | ~82 | ~155 | 1.9x |
| Llama 3 13B FP16 (tok/s) | OOM | ~68 | Now possible |
| DeepSeek R1 14B FP16 (tok/s) | OOM | ~55 | Now possible |
| CodeLlama 34B Q4 (tok/s) | ~18 | ~38 | 2.1x |
| Mixtral 8x7B Q4 (tok/s) | OOM | ~38 | Now possible |
| SDXL 1024×1024 | ~5s | ~2.5s | 2x |
| Whisper Large v3 (RTF) | ~0.07 | ~0.035 | 2x |
Every existing workload runs approximately twice as fast. New workloads that were impossible on 24GB — 13B FP16, Mixtral 8x7B, Qwen 14B FP16 — become available. Verify these numbers on the tokens-per-second benchmark.
New Capabilities at 32GB
The extra 8GB of VRAM opens several practical use cases:
- 13B-14B FP16 — run DeepSeek R1 14B, Qwen 2.5 14B, and Llama 3 13B without quantisation
- Mixtral 8x7B Q4 — the most capable open MoE model, now fits on one GPU
- 34B Q4 with long context — 12GB headroom for KV cache enables 8K+ context
- Multi-model stacks — run chat + code + embeddings simultaneously
- FP4 inference — Blackwell-native quantisation for better quality-at-speed
For model-specific deployment guides, see vLLM on the RTX 5090 and Ollama on the RTX 5090.
Cost Difference and ROI Calculation
| Metric | RTX 3090 | RTX 5090 |
|---|---|---|
| Monthly hosting | ~$100-150/mo | ~$200-280/mo |
| Cost per 1M tokens (8B FP16) | ~$0.06 | ~$0.04 |
| Equivalent API cost at volume | ~$400/mo | ~$800/mo |
| Monthly savings vs API | ~$250-300 | ~$520-600 |
| Payback vs 3090 extra cost | — | ~1 month at volume |
The 5090 costs roughly $100-130 more per month than the 3090, but delivers 2x the throughput and opens new model tiers. At production volumes, the lower cost-per-token means the additional hosting cost pays for itself quickly. Use the LLM cost calculator for precise ROI with your specific workload and the GPU vs API comparison to see savings against cloud APIs.
Verdict: When to Upgrade
Upgrade if: you need 13B+ FP16 models, you want 2x throughput on existing workloads, you serve multiple concurrent users, or you plan to run Mixtral/MoE models. The 5090 is the most impactful single-GPU upgrade available.
Keep the 3090 if: 24GB covers all your model needs, your workload is light enough that current speed is sufficient, or the budget difference is prohibitive.
Browse the full GPU Comparisons section for more matchups. For the complete self-hosting guide, see how to self-host LLMs.
Upgrade to RTX 5090: 32GB Blackwell
Nearly 2x the bandwidth, 33% more VRAM. The ultimate consumer GPU for AI inference.
Browse GPU Servers