Table of Contents
Same VRAM, two architectures apart. The RTX 3090 (Ampere, 2020) and the RTX 4090 (Ada Lovelace, 2022) both ship 24 GB of memory, so the model-fit envelope is identical. The differences live in throughput and number formats — and that’s what makes the price gap a real decision.
The RTX 3090 at £159/mo is 45% cheaper than the RTX 4090 at £289/mo, and they fit the same models (24 GB each). The 4090 is roughly 30% faster on FP16 and ships native FP8 hardware the 3090 doesn’t have. Rule of thumb: pick the 3090 if FP16/INT8 is your serving format; pick the 4090 if you want FP8 throughput or future-proofing against FP8-native models. Cost per million tokens lands at £0.16 (3090 FP16), £0.20 (4090 FP16), and £0.17 (4090 FP8) at 60% utilisation.
What’s the same
- 24 GB VRAM on both cards — same model-fit envelope. A 13B FP16 model fits with KV-cache headroom on either; a 70B AWQ-INT4 fits on neither (you need an RTX 6000 Pro 96 GB or dual-card setup for that).
- Datacentre form factor at GigaGPU — both deploy as single-card baremetal nodes with the same NVMe, networking, and power envelope.
- PCIe Gen 4 x16 host interface — identical bandwidth to the host, so multi-card model parallelism scales the same way.
- CUDA compute capability is close enough that almost every framework (vLLM, TensorRT-LLM, llama.cpp, ComfyUI, SDXL, Whisper) runs without code changes on either card.
What’s different
- Architecture: Ampere (3090, GA102) vs Ada Lovelace (4090, AD102). Two generations — SM count, clocks, and Tensor Core revision all jump.
- FP8 hardware: the 4090 has 4th-gen Tensor Cores with native FP8 (E4M3/E5M2). The 3090 does not — on Ampere you’re capped at FP16 / BF16 / INT8 for accelerated paths. This is the single biggest reason to pay the 4090 premium.
- Throughput: ~30% advantage to the 4090 on FP16 LLM inference, larger on compute-bound image-gen workloads (SDXL, Flux).
- Memory bandwidth: 3090 is 936 GB/s (GDDR6X), 4090 is ~1,008 GB/s (GDDR6X). About 150 GB/s more on the 4090 once you factor in the higher effective clocks, but the gap is much smaller than the compute gap. For decode-heavy LLM serving, bandwidth dominates — which is why the FP16 throughput advantage is “only” ~30%.
- RT cores / DLSS 3: irrelevant for AI inference. The Tensor Core upgrade and FP8 support are the parts that matter.
Token throughput data
All numbers below are aggregate throughput (sum of decode tokens/sec across concurrent requests), measured under vLLM, single-card, FP16 weights as the baseline unless noted. Quantised numbers use AWQ for INT4 and TensorRT-LLM for FP8.
| Workload | RTX 3090 24 GB | RTX 4090 24 GB | 4090 advantage |
|---|---|---|---|
| Mistral 7B FP16 | ~720 tok/s | ~950 tok/s | +32% |
| Mistral 7B FP8 | no native FP8 | ~1,100 tok/s | 4090 only |
| Llama 3.1 8B FP16 | ~680 tok/s | ~890 tok/s | +31% |
| Qwen 2.5 14B INT4 | ~410 tok/s | ~540 tok/s | +32% |
| Qwen 2.5 14B FP8 | no native FP8 | ~620 tok/s | 4090 only |
| SDXL 1024² FP16 | ~5.0 s/image | ~3.4 s/image | +47% |
| Whisper Large-v3 | ~6× RTF | ~8× RTF | +33% |
Two things to note: (1) on FP16 the 4090 is faster but not dramatically — the bandwidth gap is small, so memory-bound decode doesn’t scale with the compute uplift. (2) FP8 on the 4090 is the only place you see a real generational leap, and it’s only available because the silicon supports it. The 3090 has no FP8 path at all.
Cost per million tokens
To compare hourly cards on a per-token basis, use:
cost_per_M = monthly_£ / (tok_per_sec × 86400 × 30 × utilisation) × 1,000,000
At 60% utilisation (a realistic target for steady-state self-hosted inference):
| Card / format | Monthly | Throughput | £ / 1M output tokens |
|---|---|---|---|
| RTX 3090 — 7B FP16 | £159 | 720 tok/s | £0.16 |
| RTX 4090 — 7B FP16 | £289 | 950 tok/s | £0.20 |
| RTX 4090 — 7B FP8 | £289 | 1,100 tok/s | £0.17 |
The 3090 wins on raw £/token at FP16 because the price drop (-45%) is bigger than the throughput drop (-24%). The 4090’s FP8 path closes the gap to within ~6% of the 3090 — and at that point you’re paying a small premium for 1,100 tok/s of headroom and a future-proof number format.
When to pick which
- Pick the RTX 3090 (£159/mo) if you serve FP16 or INT8 / INT4 quantised models and want the lowest £/token. This is the right answer for most production LLM hosting in 2026.
- Pick the RTX 3090 if your workload is bandwidth-bound (long-context decode, big KV-cache) — the 4090’s extra compute doesn’t buy you much there.
- Pick the RTX 4090 (£289/mo) if you’re running FP8-native models (Llama 3 with FP8 KV-cache, Mistral FP8 checkpoints, FLUX FP8) or using TensorRT-LLM’s FP8 path.
- Pick the RTX 4090 if image generation is the primary workload — SDXL/Flux are compute-bound and the 4090 is ~45% faster per image.
- Pick the RTX 4090 if you want a 24 GB card that will still be a sensible serving target through the next two model generations — FP8 is the inference format every major lab is now shipping.
Verdict
Same VRAM, same model-fit. If you’re FP16-only, the 3090 is the better deal — full stop. If you want FP8 throughput, the 4090 is the only 24 GB option at this price tier (the next FP8-capable card up is the £399 RTX 5090 with 32 GB or the £899 RTX 6000 Pro with 96 GB).
Bottom line
For most self-hosted LLM serving, start with the RTX 3090 24 GB at £159/mo. Move up to the RTX 4090 24 GB at £289/mo when you need FP8 hardware or compute-bound throughput. See the full lineup at best GPU for LLM inference.