The RTX 4090 24GB and RTX 3090 24GB share the same VRAM ceiling, which makes them look interchangeable on a spec sheet. They are not. The 4090’s Ada AD102 die brings native 4th-gen FP8 tensor cores and roughly 3x the inference throughput of the Ampere GA102 in the 3090 on modern quantised workloads. The 3090 retains one trump card the 4090 lacks: NVLink-3, useful for multi-card setups. This guide compares both for the workloads buyers actually run on UK dedicated GPU hosting, with reference to the wider gigagpu range, and lays out where each card wins decisively.
Contents
- Spec sheet at a glance
- Inference throughput across workloads
- FP8 native vs emulated: the architectural gap
- Model fit and KV headroom
- Per-workload winner table (10 workloads)
- Cost-per-token and watts-per-token
- Production gotchas
- Verdict and when each card wins
Spec sheet at a glance
| Spec | RTX 4090 24GB | RTX 3090 24GB |
|---|---|---|
| Architecture | Ada AD102 | Ampere GA102 |
| CUDA cores | 16,384 | 10,496 |
| Tensor cores | 512 (4th gen) | 328 (3rd gen) |
| VRAM | 24GB GDDR6X | 24GB GDDR6X |
| Bandwidth | 1,008 GB/s | 936 GB/s |
| TDP | 450W | 350W |
| Native FP8 | Yes (4th gen) | No (emulated via FP16) |
| FP16 TFLOPS dense | 165 | 71 |
| FP8 TFLOPS sparse (theoretical) | ~660 | n/a |
| NVLink | No | NVLink-3 (~112 GB/s) |
| PCIe | Gen4 x16 | Gen4 x16 |
| Launch year | 2022 | 2020 |
| Approx UK dedicated £/mo | £550 | £275 |
Inference throughput across workloads
The headline gap is biggest on FP8 workloads where the 4090 has dedicated silicon and the 3090 must fall back to FP16 or emulate FP8 in software via Marlin or bitsandbytes kernels. For straight FP16 work the gap narrows but stays decisive. For INT4 work (AWQ, GPTQ) both cards lean on shader-based dequantisation and the gap is smallest.
| Workload | 4090 t/s | 3090 t/s | Speedup |
|---|---|---|---|
| Llama 3.1 8B FP8 batch 1 | 198 | 65 (emulated) | 3.0x |
| Llama 3.1 8B FP16 batch 1 | 105 | 52 | 2.0x |
| Llama 3.1 8B AWQ INT4 batch 1 | 180 | 150 | 1.2x |
| Llama 3.1 70B AWQ INT4 batch 1 | 22 | 14 | 1.6x |
| Llama 3.1 70B AWQ INT4 conc 4 | ~110 aggr | ~58 aggr | 1.9x |
| Qwen 2.5 14B FP8 batch 1 | 120 | 40 (emulated) | 3.0x |
| Mixtral 8x7B AWQ INT4 | ~38 | ~24 | 1.6x |
| SDXL 1024×1024, 30 steps | 3.4s | 7.1s | 2.1x |
| Whisper Large v3, 1hr audio | 22s | 54s | 2.5x |
| Flux.1 Dev 1024×1024 | 14s | ~38s | 2.7x |
FP8 native vs emulated: the architectural gap
The 3090 can run FP8 weights through Marlin or bitsandbytes kernels but the matmul itself happens at FP16 with conversion overhead at every layer. That costs roughly 60-70% of theoretical FP8 throughput. The 4090 has hardware FP8 matmul, so vLLM, TensorRT-LLM, and SGLang all hit close to peak. For Llama 70B AWQ INT4 the difference is smaller because the dominant cost is INT4 dequantisation – both cards do that in shaders.
Practically, this means a 4090 deployment for FP8 chat APIs runs three times the user concurrency of a 3090 for the same model. If your roadmap is “FP8 everything” – and most modern inference stacks default to that – the 4090’s architectural advantage is decisive. Read the deeper analysis in FP8 tensor cores on Ada.
What FP8 emulation actually costs on a 3090
vLLM’s `–quantization fp8` flag on a 3090 dispatches Marlin kernels that pack FP8 weights but execute matmul at FP16. The kernel does the dequantisation per-layer, costing 30-40% of the saved memory bandwidth back as compute overhead. Net result: FP8 on 3090 is faster than FP16 on memory-bound layers but slower than FP8 on Ada by a factor of 2-3x.
Model fit and KV headroom
Both cards have 24GB so the maximum model footprint is identical. What changes is whether you can use the modern quantisation formats efficiently, and how much KV cache you have left after the weights load. The 4090’s native FP8 KV-cache support gives it more effective concurrency on the same VRAM.
| Model | 4090 24GB fits? | 3090 24GB fits? | KV / concurrency notes |
|---|---|---|---|
| Llama 3.1 8B FP16 | Yes (~16GB) | Yes (~16GB) | Identical fit |
| Llama 3.1 8B FP8 | Yes (~8GB), 16GB free for KV | Yes emulated, slower | 4090 ~3x throughput |
| Llama 3.1 70B AWQ INT4 | Yes (~17GB + FP8 KV) | Yes (~17GB + FP16 KV tight) | 3090 has less KV headroom |
| Qwen 2.5 14B FP8 | Yes, lots of KV | Emulated, tight | 4090 ~3x throughput |
| Qwen 2.5 32B AWQ INT4 | ~22GB tight | ~22GB tight | Both marginal |
| SDXL + refiner | Yes | Yes | Either works |
| Mixtral 8x7B AWQ INT4 | Yes (~25GB tight, may need offload) | Yes (similar) | Marginal both |
| Flux.1 Dev BF16 | Fits with offload | Fits with offload | Both need CPU offload |
Per-workload winner table (10 workloads)
| Workload | 4090 winner | 3090 winner | Why |
|---|---|---|---|
| Llama 8B FP8 chat API, high concurrency | Yes | No | Native FP8 = 3x throughput |
| Llama 70B AWQ INT4 batch jobs | Marginal | Yes (price) | 3090 cheaper per token, INT4-bound |
| Qwen 14B FP8 chat | Yes | No | FP8 native 3x advantage |
| SDXL 200 images/day | Yes (speed) | Yes (price) | 4090 2x faster, 3090 half price |
| Flux.1 Dev image gen | Yes | No | 2.7x throughput, latency matters |
| Whisper transcription queue | Yes | Marginal | 2.5x faster, batch latency matters |
| 2-card NVLink scaling | No | Yes | 4090 has no NVLink, 3090 has NVLink-3 |
| Multi-tenant SaaS, FP8 inference | Yes | No | FP8 throughput separates cards |
| Fine-tuning sprints (QLoRA 8B) | Yes | No | 2x training throughput |
| Cost-bound research lab | No | Yes | 3090 at half price, FP8 emulation acceptable |
Cost-per-token and watts-per-token
Assume £550/month for a dedicated 4090 and £275/month for a dedicated 3090. The 4090 costs ~2.0x more. For most LLM inference workloads it produces 2.5-3x the throughput, so the 4090 wins on cost-per-token despite the higher monthly fee. The exception is INT4-bound work where the gap closes.
| Workload | 4090 £/M tokens | 3090 £/M tokens | 4090 W/Mtok | 3090 W/Mtok | Winner |
|---|---|---|---|---|---|
| Llama 8B FP8 24/7 conc 8 | £0.039 | £0.058 | 0.061 | 0.092 | 4090 both axes |
| Llama 70B AWQ INT4 24/7 conc 4 | £0.34 | £0.25 | 0.66 | 0.40 | 3090 both axes |
| Qwen 14B FP8 24/7 conc 8 | £0.063 | £0.092 | 0.10 | 0.16 | 4090 both axes |
| SDXL £/image, 24/7 queue | £0.0009 | £0.0010 | 0.0014 | 0.0016 | 4090 both axes |
| Flux.1 Dev £/image | £0.0036 | £0.0050 | 0.0056 | 0.0080 | 4090 both axes |
| Whisper £/audio-hour | £0.0040 | £0.0049 | 0.0063 | 0.0079 | 4090 both axes |
Production gotchas
- Marlin/bitsandbytes kernel availability on 3090. FP8 emulation requires specific vLLM/SGLang versions. Older deployments fall back to slow paths. Pin versions and benchmark.
- 3090 cooling under sustained load. The 350W reference design throttles in poorly-ventilated 1U/2U chassis. Datacentre installs need active airflow. Cross-reference thermal performance.
- NVLink-3 availability on 3090 dedicated. Many hosting providers do not bridge dual 3090s with NVLink. Confirm before you order if you intend to use the bandwidth.
- FP8 KV cache on 3090. vLLM’s `–kv-cache-dtype fp8` works but with FP16 conversion overhead per access. Real win is smaller than on 4090.
- Driver lifecycle for Ampere. The 3090 ships under the long-term-support driver branch. Newer features (Triton kernels, FA3) sometimes target Ada+ first.
- Power budget for 2x 3090. 700W GPU + 200W host = 900W sustained. Many shared rack PDUs limit to 1500W per outlet; 2x 3090 + headroom is tight.
- Resale value asymmetry. 4090 retains value strongly; 3090 has depreciated. Affects total cost of ownership if you’re buying rather than renting.
Verdict and when each card wins
Pick the 4090 for any FP8 workload (8B or 14B Llama/Qwen/Mistral chat APIs, modern quantised inference at high concurrency), for image and audio diffusion (SDXL, Flux, Whisper), for fine-tuning sprints, and for any deployment where cost-per-token at scale matters more than absolute monthly spend. Pick the 3090 if your budget is tight and your workload is Llama 70B AWQ INT4 batch jobs where INT4 dequantisation dominates and FP8 emulation does not matter, or if you need NVLink-3 for cheap multi-card setups (the 4090 has no NVLink at all). For most modern inference stacks running FP8 chat models, the 4090 is the better buy at UK dedicated rates; for cost-sensitive research labs grinding INT4 traffic, the 3090 is still defensible.
Native FP8 throughput
Ada AD102 with hardware FP8 4th-gen tensor cores. Three times the FP8 inference of Ampere on the same 24GB VRAM. UK dedicated hosting.
Order the RTX 4090 24GBSee also: RTX 4090 vs 3090 for AI, FP8 tensor cores on Ada, Llama 3 8B benchmark, Llama 70B INT4 benchmark, spec breakdown, tier positioning 2026, tokens per watt, vLLM setup, FP8 Llama deployment, thermal performance, power draw efficiency, or 5090 decision, or 5060 Ti decision, 5090 vs 3090, 70B INT4 VRAM, best GPU for fine-tuning.