Table of Contents
RTX 3090 vs RTX 5090: Spec Overview
If you are running open-source LLM inference on a dedicated server, the RTX 3090 and RTX 5090 are two of the most common choices. Both offer 24 GB of VRAM, but the generational leap in architecture means real-world throughput is very different. Before diving into tokens-per-second benchmarks, here is a quick spec comparison.
| Spec | RTX 3090 | RTX 5090 |
|---|---|---|
| Architecture | Ampere (GA102) | Ada Lovelace (AD102) |
| VRAM | 24 GB GDDR6X | 24 GB GDDR6X |
| Memory Bandwidth | 936 GB/s | 1,008 GB/s |
| FP16 Tensor TFLOPS | 142 | 330 |
| TDP | 350 W | 450 W |
| CUDA Cores | 10,496 | 16,384 |
| Typical Server Cost | ~$0.45/hr | ~$1.10/hr |
The 5090 delivers roughly 2.3x the FP16 tensor throughput, but costs about 2.4x as much to rent. That ratio is the central question for anyone comparing GPU options for inference workloads.
LLM Inference Benchmarks (Tokens/sec)
We tested both GPUs using vLLM with continuous batching on a range of popular open-source models. All runs used FP16 precision with a batch size of 1 (single-user scenario) and a batch size of 8 (concurrent users).
| Model | Params | RTX 3090 (tok/s, bs=1) | RTX 5090 (tok/s, bs=1) | 5090 Speedup |
|---|---|---|---|---|
| Llama 3 8B | 8B | 62 | 118 | 1.90x |
| Mistral 7B v0.3 | 7B | 68 | 127 | 1.87x |
| Qwen 2.5 14B (GPTQ-4bit) | 14B | 38 | 74 | 1.95x |
| DeepSeek-R1 8B | 8B | 59 | 112 | 1.90x |
| Phi-3 Mini 3.8B | 3.8B | 105 | 198 | 1.89x |
| Llama 3 70B (AWQ-4bit) | 70B | 11 | 22 | 2.00x |
Batched Throughput (8 Concurrent Users)
| Model | RTX 3090 (tok/s total) | RTX 5090 (tok/s total) | 5090 Speedup |
|---|---|---|---|
| Llama 3 8B | 185 | 390 | 2.11x |
| Mistral 7B v0.3 | 198 | 415 | 2.10x |
| DeepSeek-R1 8B | 172 | 362 | 2.10x |
| Phi-3 Mini 3.8B | 310 | 640 | 2.06x |
In batched scenarios the 5090 pulls further ahead, hitting roughly 2.1x aggregate throughput. If you are comparing these cards for production inference, our cost per 1M tokens analysis breaks down how this translates to savings versus API providers.
Cost per Million Tokens Comparison
Raw speed is only half the story. What matters for a production workload is cost per million tokens generated. We used hourly server pricing from GigaGPU dedicated GPU hosting to calculate the numbers below.
| Model | RTX 3090 $/1M tokens | RTX 5090 $/1M tokens | Better Value |
|---|---|---|---|
| Llama 3 8B (bs=1) | $2.02 | $2.59 | RTX 3090 |
| Llama 3 8B (bs=8) | $0.68 | $0.78 | RTX 3090 |
| Mistral 7B v0.3 (bs=1) | $1.84 | $2.41 | RTX 3090 |
| DeepSeek-R1 8B (bs=1) | $2.12 | $2.73 | RTX 3090 |
| Llama 3 70B AWQ (bs=1) | $11.36 | $13.89 | RTX 3090 |
A consistent pattern emerges: the RTX 3090 delivers a lower cost per token in nearly every scenario. The 5090 is faster in absolute terms, but its higher rental price cancels out most of the throughput advantage. Check our cost-per-million-tokens calculator to model your own workload.
VRAM Limits and Model Compatibility
Both GPUs share the same 24 GB VRAM ceiling, so model compatibility is identical. The key thresholds:
- 7-8B FP16 — fits comfortably (~14-16 GB used)
- 13-14B FP16 — tight fit (~24 GB, may OOM with long contexts)
- 14B GPTQ-4bit — fits well (~9 GB)
- 70B AWQ-4bit — requires ~38 GB, so needs multi-GPU clusters or tensor parallelism across two cards
If your workload demands larger models at full precision, consider looking at the RTX 5090 with its 32 GB of VRAM or pairing two 3090s via NVLink.
Which GPU Should You Pick?
Choose the RTX 3090 if:
- Budget efficiency is your top priority
- Your models fit in 24 GB VRAM (most 7-8B and quantised 13B models)
- You are running low-to-moderate concurrency (1-4 users)
- You want the cheapest GPU for AI inference per token generated
Choose the RTX 5090 if:
- Latency matters more than cost (e.g., real-time chatbots)
- You are serving 8+ concurrent users and need higher aggregate throughput
- You want headroom for compute-bound tasks like speculative decoding
For many self-hosted LLM deployments, the RTX 3090 remains the best value in 2025. Our self-host LLM guide walks through the full setup process, and the vLLM vs Ollama comparison helps you choose the right serving framework.
Run Your Own LLM Inference Server
Get a dedicated RTX 3090 or RTX 5090 server with vLLM pre-installed. No shared resources, no token limits, full root access.
Browse GPU ServersFAQ
Is the RTX 5090 twice as fast as the RTX 3090 for LLMs?
Close, but not quite. In single-user inference, the 5090 is roughly 1.9x faster. Under batched workloads it stretches to about 2.1x. However, the higher server cost means cost-per-token is still lower on the 3090 in most cases.
Can both GPUs run Llama 3 70B?
Not in FP16 on a single card. At 4-bit quantisation (AWQ or GPTQ), the 70B model needs ~38 GB, so you will need at least two GPUs. Both cards support this via tensor parallelism in vLLM.
Should I use vLLM or Ollama for inference?
For production throughput, vLLM with continuous batching is significantly faster. Ollama is simpler for single-user experimentation. See our detailed comparison.
How does the RTX 3090 compare to the newer RTX 5080?
The RTX 5080 vs RTX 3090 comparison covers this in detail. The 5080 brings Blackwell architecture but only 16 GB VRAM, which limits the models it can run at full precision.