The RTX 3090 was Ampere’s flagship in 2020, and a remarkable number of UK research labs and bootstrapped AI startups still run it as a workhorse. The RTX 4090 24GB matches it on memory capacity but rebuilds nearly everything else around it: a 76-billion-transistor AD102 die on TSMC 4N versus an 8N GA102, a 12x larger L2 cache, native FP8 tensor cores and a substantially faster boost clock. If you are sizing inference nodes on UK GPU hosting, the question is no longer whether the 4090 wins — it always does, by every meaningful metric — but how much faster it is on each specific workload, where the gap narrows enough that a secondhand 3090 still pencils out, and whether the FP8 ecosystem has matured enough to make Ampere genuinely obsolete for new builds in 2026.
Contents
- Spec-by-spec architectural comparison
- Native FP8 versus emulated FP16 fallbacks
- The 72MB L2 cache and what it changes
- LLM throughput delta across nine workloads
- Diffusion, Whisper and embedding workloads
- Power efficiency and tokens-per-joule
- Per-workload winner table
- vLLM serving configurations side by side
- Production gotchas when choosing between them
- Verdict and decision criteria
Spec-by-spec architectural comparison
The two cards share a 24GB GDDR6X buffer and a 384-bit memory bus, so on paper they look like siblings. Look beneath the headline numbers and they are different machines built for different eras of AI inference. Ampere was designed when FP16/BF16 was the dominant training and inference precision and L2 cache was small because workloads still bounced through global memory frequently. Ada Lovelace was designed after the FP8 paper landed at Hopper, with a working hypothesis that the L2 should be large enough to hold significant slices of attention working sets, and that tensor cores should support sub-byte formats natively rather than emulating them.
| Spec | RTX 3090 (Ampere GA102) | RTX 4090 (Ada AD102) | Delta |
|---|---|---|---|
| Process node | Samsung 8N (10nm class) | TSMC 4N (5nm class) | Two node jumps |
| Transistors | 28.3 billion | 76.3 billion | 2.7x |
| Die size | 628 mm² | 608 mm² | Smaller, denser |
| SM count | 82 | 128 | +56% |
| CUDA cores | 10,496 | 16,384 | +56% |
| Tensor cores | 328 (3rd gen) | 512 (4th gen) | +56% with FP8 |
| RT cores | 82 (2nd gen) | 128 (3rd gen) | +56% |
| Boost clock | 1.70 GHz | 2.52 GHz | +48% |
| L1 / shared per SM | 128 KB | 128 KB | Same |
| L2 cache | 6 MB | 72 MB | 12x |
| VRAM | 24 GB GDDR6X (19.5 Gbps) | 24 GB GDDR6X (21 Gbps) | Tied capacity |
| Memory bandwidth | 936 GB/s | 1008 GB/s | +8% |
| FP32 TFLOPS | 35.6 | 82.6 | 2.32x |
| FP16 dense TFLOPS | 71 | 165 | 2.32x |
| FP8 dense TFLOPS | None (emulated 71) | 660 | 9.3x effective |
| NVENC / NVDEC | 7th / 5th gen, no AV1 enc | 2x 8th + 5th, AV1 enc | +AV1 |
| PCIe | Gen 4 x16 | Gen 4 x16 | Same |
| NVLink | NVLink-3 @ 112 GB/s | None (PCIe only) | Removed |
| TDP | 350W | 450W | +29% |
Bandwidth is the surprise: a paltry 8% uplift between the two cards. Almost the entire decode-phase advantage of the 4090 comes from the L2 cache absorbing repeated weight reads, plus the higher clock and tensor-core throughput, rather than raw HBM-class bandwidth. See the spec breakdown and tier positioning for full context on where the 4090 sits in the 2026 stack.
Native FP8 versus emulated FP16 fallbacks
Ampere has no FP8 tensor instruction. When a vLLM build asks for FP8, the kernel falls back to FP16 on the 3090, halving the effective throughput per tensor-core op compared with what the 4090 can do natively. There is no software trick to recover this — Marlin and similar kernels also emulate. AWQ INT4 is the only quantisation that genuinely lifts a 3090 onto a competitive footing for decode, because the dominant cost there is weight memory traffic, which compresses 4x regardless of tensor-core generation. Even then, the 3090’s missing FP8 path means its KV cache still has to be FP16 (or INT8 via experimental kernels), doubling KV-cache memory pressure and limiting context length on long-prompt RAG workloads.
Ada’s 4th-generation tensor cores expose FP8 in two flavours: E4M3 (more mantissa, used for activations and weights) and E5M2 (more exponent, used for gradients). The Transformer Engine selects automatically, and vLLM’s --quantization fp8 path uses E4M3 for both weights and KV cache. See the FP8 tensor cores deep-dive for the kernel-level details.
The 72MB L2 cache and what it changes
This is the single biggest under-discussed change between Ampere and Ada. The 3090 has 6 MB of L2; the 4090 has 72 MB — a 12x increase. For a Llama-class model serving batch 1, every decoded token requires reading the entire weight tensor for every layer through the matmul kernels. With 6 MB of L2, virtually every weight read misses cache and goes to GDDR6X. With 72 MB, attention KV slabs, embedding tables and the active layer’s parameters can stay resident, cutting effective memory traffic by 30-45% on typical decode batches. That single number is why the 4090 outpaces the 3090’s headline +8% bandwidth by closer to 2.0-2.4x on real LLM decode.
LLM throughput delta across nine workloads
| Workload | RTX 3090 (best path) | RTX 4090 (best path) | 4090 / 3090 |
|---|---|---|---|
| Llama 3.1 8B FP16 decode b1 | 52 t/s | 95 t/s | 1.83x |
| Llama 3.1 8B FP8 decode b1 | 65 t/s (emulated) | 195 t/s | 3.0x |
| Llama 3.1 8B AWQ decode b1 | 150 t/s | 225 t/s | 1.50x |
| Llama 3.1 8B FP8 batch 32 aggregate | ~430 t/s | 1100 t/s | 2.56x |
| Llama 3.1 70B AWQ INT4 decode b1 | 11-13 t/s | 22-24 t/s | 1.85x |
| Qwen 2.5 14B AWQ decode b1 | 78 t/s | 135 t/s | 1.73x |
| Qwen 2.5 32B AWQ decode b1 | 34 t/s | 65 t/s | 1.91x |
| Mistral 7B FP8 decode b1 | 72 t/s (emulated) | 215 t/s | 2.99x |
| Mixtral 8x7B AWQ decode b1 | 44 t/s | 85 t/s | 1.93x |
The pattern is clear: when FP8 is on the critical path, the 4090 is 2.5-3.0x faster. When AWQ INT4 carries the load, the gap collapses to 1.5-1.9x — still significant, but no longer a generational chasm. See the Llama 3 8B benchmark and Llama 70B INT4 benchmark for the full token-by-token data.
Diffusion, Whisper and embedding workloads
| Workload | RTX 3090 | RTX 4090 | 4090 / 3090 |
|---|---|---|---|
| SDXL 1024×1024 30-step | 4.6s | 2.0s | 2.30x |
| SDXL batch 4 | 16.8s | 6.5s | 2.58x |
| FLUX.1-schnell FP16 4-step | OOM (24GB tight) | 2.6s | n/a |
| FLUX.1-dev FP8 30-step | ~10.5s (FP16 path) | 4.1s | 2.56x |
| Whisper large-v3-turbo INT8 RTF | 34x RT | 80x RT | 2.35x |
| BGE-large embeddings batch 64 | ~2,100 q/s | ~5,200 q/s | 2.48x |
FLUX.1-dev in FP16 is the workload that exposes the 3090 most starkly: peak VRAM during the joint attention pass touches 22 GB, leaving almost nothing for the text encoders unless you offload them to CPU. The 4090’s FP8 path keeps weights at ~12 GB and finishes in just over 4 seconds. For diffusion-heavy pipelines see the best GPU for Stable Diffusion guide.
Power efficiency and tokens-per-joule
This is where the 3090 is quietly humiliated. At idle the gap is small (25W vs 30W), but under load the 4090 delivers vastly more work per watt. On Llama 3 8B FP8 batch 32, the 4090 sustains ~1100 aggregate t/s at ~360W — about 3.05 tokens per joule. The 3090 produces ~430 aggregate t/s at ~330W on the AWQ path (FP8 emulated is even worse), giving roughly 1.30 t/J. That is a 2.35x efficiency advantage, which over a year on a £0.18/kWh UK tariff translates to hundreds of pounds of recovered electricity and dramatically lower thermal load on rack cooling.
| Card | Aggregate t/s | Sustained watts | Tokens/Joule | Annual kWh @ 24/7 |
|---|---|---|---|---|
| RTX 3090 (AWQ) | 430 | 330 | 1.30 | 2,890 |
| RTX 4090 (FP8) | 1,100 | 360 | 3.05 | 3,154 |
| RTX 4090 (FP8, 380W cap) | 1,023 | 302 | 3.39 | 2,646 |
See the tokens-per-watt analysis and power-draw efficiency post for the undervolting protocol that recovers nearly 12% efficiency for a 7% throughput cost.
Per-workload winner table
| Workload | Winner | Margin | Notes |
|---|---|---|---|
| Llama 3 8B chatbot, < 30 active users | RTX 4090 | 2.5-3.0x | FP8 path is decisive |
| Llama 3 70B AWQ INT4 single-user | RTX 4090 | 1.85x | Both fit 24GB, 4090 has KV headroom |
| Qwen 2.5 32B coding assistant | RTX 4090 | 1.91x | AWQ INT4 on both |
| SDXL studio, 200 imgs/day | RTX 4090 | 2.3x | 3090 viable on a budget |
| FLUX.1-dev production | RTX 4090 | 2.5x | 3090 risks OOM at batch 1 |
| Whisper transcription pipeline | RTX 4090 | 2.35x | Both work well |
| QLoRA fine-tune Llama 8B | RTX 4090 | 1.7-2.0x | BF16 grads, 4090 wins on clock |
| Bulk embedding generation | RTX 4090 | 2.5x | Bandwidth and tensor cores both matter |
| Hobby/dev workstation, 1-2 users | RTX 3090 | n/a | £/perf still favours used 3090 |
| Multi-card NVLink training (rare) | RTX 3090 | n/a | 4090 has no NVLink |
vLLM serving configurations side by side
The biggest practical difference between the two cards is the FP8 flag. On a 4090, this is the production default. On a 3090, asking for FP8 silently runs slower than FP16, so AWQ Marlin is the right answer.
# RTX 4090 24GB — Llama 3.1 8B FP8 production serve
docker run --rm --gpus all -p 8000:8000 \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 16384 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.92
# RTX 3090 24GB — same model via AWQ Marlin (FP8 emulated is slower)
docker run --rm --gpus all -p 8000:8000 \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
--quantization awq_marlin \
--max-model-len 8192 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.90
Note the lower --max-num-seqs on the 3090: with FP16 KV cache, each concurrent sequence costs roughly twice the memory of FP8 KV, so the safe ceiling is half. See the vLLM setup guide and FP8 deployment guide for the production-ready Compose files.
# Tokens-per-joule benchmark — run on either card
import time, torch, requests
N = 200
t0 = time.time()
toks = 0
for i in range(N):
r = requests.post("http://localhost:8000/v1/completions",
json={"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "Summarise the UK Online Safety Act in 80 words:",
"max_tokens": 80, "temperature": 0})
toks += r.json()["usage"]["completion_tokens"]
dt = time.time() - t0
# Read instantaneous power via nvidia-smi during the run (avg over window)
print(f"{toks/dt:.1f} t/s aggregate")
Production gotchas when choosing between them
- 3090 NVENC limit on AV1. If your pipeline does video transcoding alongside inference (Whisper feeds, content moderation), the 3090 lacks AV1 encode. The 4090 has dual 8th-gen NVENC with AV1 — material for any video-adjacent workload.
- FP8 KV cache is non-negotiable for long context on 24GB. The 3090 cannot do native FP8 KV; you will run out of memory on Llama 70B AWQ at 16k context where the 4090 is comfortable.
- 3090 cooling is a real problem in chassis. The reference 3090 FE is 2.5-slot and dumps heat sideways. In a 4U server, expect to throttle below 1.5 GHz under sustained load. The 4090’s 3.5-slot triple-fan AIB designs cool better at higher TDP.
- Triton kernel coverage gap. Several recently published kernels (FlashInfer’s prefix-sharing variants, some Mamba kernels) ship with sm_89 (Ada) tuning and fall back to slower paths on sm_86 (Ampere). The gap widens monthly.
- Driver lifetime. NVIDIA’s consumer driver branch still supports both, but Ampere’s optimisation push has clearly ended. New CUDA features land first on Ada and Hopper.
- Power supply sizing. 4090s require 12VHPWR or 12V-2×6 connectors; many older chassis lack the cable. The 3090 still uses 3x 8-pin and slots into anything.
- Used-3090 lottery. Ex-mining 3090s on the secondary market frequently have degraded GDDR6X with elevated error rates under load. Demand a clean MemTest run before purchase.
Verdict and decision criteria
For new production builds in 2026, the RTX 4090 wins on every axis except up-front capital cost. The decision tree is straightforward:
- Pick the RTX 4090 24GB if you are building a chatbot, RAG, coding assistant or image-generation backend serving real users; need FP8 throughput; care about tokens-per-watt; need AV1 video encode; or want a path to FLUX.1-dev without OOM gymnastics. See the 4090 or 3090 decision guide.
- Pick the RTX 3090 24GB if budget is the binding constraint; you are doing one-off research with batch 1; you specifically need NVLink for a small two-card training experiment; or you can find a clean used unit at half the 4090’s price and accept a 1.7-2.5x slowdown.
- Pick neither if you need 70B+ at FP8 (consider RTX 5090 32GB) or 100B+ at any precision (consider RTX 6000 Pro 96GB).
For the typical 200-MAU SaaS RAG workload or 12-engineer coding-assistant team, the 4090 is the only sensible choice today. For a hobbyist running batch-1 chat for personal use, a £600 used 3090 is still a defensible buy.
Run the 4090 properly, not the 3090 you’ve outgrown
GigaGPU’s UK dedicated hosting puts you on a fresh RTX 4090 24GB in a properly cooled 4U chassis with the FP8 vLLM image pre-flighted — no 12VHPWR sourcing, no MemTest gambling, no NVENC fallback.
Order the RTX 4090 24GBSee also: RTX 4090 spec breakdown, FP8 tensor cores on Ada, RTX 4090 vs RTX 5090, tokens-per-watt, Llama 70B INT4 deployment, 4090 or 3090 decision, 2026 tier positioning.