The 2026 GPU lineup is the most crowded NVIDIA has ever shipped. Blackwell is mature, Ada is stable, Ampere is cheap on the second-hand market, and the datacentre tiers (H100, A100, RTX 6000 Pro) cover the high end. The RTX 4090 24GB sits in a peculiar middle position: a generation behind the headlines but still the best price-per-token card for many workloads on UK GPU hosting. This guide walks the entire lineup, gives the workload-by-workload pick rationale, lays out used vs new pricing, identifies the workloads where the 4090 stays relevant in 2026, signals when to consider sunsetting, and ends with a verdict.
Contents
- The full 2026 NVIDIA lineup
- Where the 4090 fits theoretically
- When to pick the 4090 vs each alternative
- Used vs new pricing in 2026
- Named workloads where it stays relevant
- Signals to consider sunsetting
- Production gotchas
- Verdict
The full 2026 NVIDIA lineup
| Card | Tier | VRAM | Bandwidth | Tensor gen | FP8 native | FP4 native |
|---|---|---|---|---|---|---|
| RTX 3090 24GB | Legacy / used market | 24 GB GDDR6X | 936 GB/s | 3rd Ampere | No | No |
| RTX 5060 Ti 16GB | Entry consumer Blackwell | 16 GB GDDR7 | 448 GB/s | 5th | Yes | Yes |
| RTX 5080 16GB | Mid consumer Blackwell | 16 GB GDDR7 | 960 GB/s | 5th | Yes | Yes |
| RTX 4090 24GB | Top consumer Ada | 24 GB GDDR6X | 1008 GB/s | 4th | Yes | No |
| RTX 5090 32GB | Top consumer Blackwell | 32 GB GDDR7 | 1792 GB/s | 5th | Yes | Yes |
| RTX 6000 Pro 96GB | Workstation Blackwell | 96 GB GDDR7 ECC | ~1.4 TB/s | 5th | Yes | Yes |
| A100 80GB | Datacentre legacy Ampere | 80 GB HBM2e | 2 TB/s | 3rd | No | No |
| H100 80GB | Datacentre Hopper | 80 GB HBM3 | 3.35 TB/s | 4th | Yes | No |
Where the 4090 fits theoretically
The 4090 is built on Ada AD102: 16,384 CUDA cores, 24 GB GDDR6X, 1008 GB/s memory bandwidth, 72 MB L2 cache, 450W TDP, 4th-generation tensor cores with native FP8 (1320 TFLOPS sparse). It is the only consumer card with both 24 GB and FP8 in the 2026 lineup. The 5060 Ti and 5080 sit at 16 GB which limits them to 7-13B class models comfortably. The 5090 leapfrogs at 32 GB. The 6000 Pro is in workstation territory at 96 GB. Below the 4090 is the 3090 — same 24 GB but no FP8 and 7% less bandwidth.
The bandwidth-VRAM trade-off
For LLM decode the dominant constraint is memory bandwidth: every decoded token requires reading the full weight tensor. A 14B FP8 model = 14 GB of weights; at 1008 GB/s that is a theoretical 72 traversals/second, or 72 t/s before kernel overhead. The 4090 hits roughly 140 t/s on a 14B FP8 model thanks to L2 reuse and FlashAttention 3. The 5090 doubles that to ~280 t/s on the same model thanks to GDDR7’s 1792 GB/s.
When to pick the 4090 vs each alternative
| Alternative | Pick the 4090 over it when | Pick the alternative when |
|---|---|---|
| RTX 3090 24GB | You need FP8 native (Llama 3 8B, Phi-3 Medium, Qwen Coder 32B AWQ) | Pure throughput on FP16 / GPTQ workloads, lowest acquisition cost |
| RTX 5060 Ti 16GB | You need to host any model larger than 14B (32B AWQ, 70B INT4) | Lower TDP, lowest cost-per-token for 7-8B FP8 workloads |
| RTX 5080 16GB | You need 24 GB headroom (KV-heavy long context, 32B AWQ) | Latest Blackwell, FP4 inference, lower power; for <=14B FP8 |
| RTX 5090 32GB | You’re price-sensitive and don’t need 32 GB or FP4 | You need 32 GB (larger 70B KV, 32B FP8 not AWQ), or FP4 inference |
| RTX 6000 Pro 96GB | You’re price-sensitive and a single 14-32B model is enough | You need 70B FP8 native, 180B AWQ, ECC, multi-model on one card |
| A100 80GB | You need FP8 (A100 lacks it) and AWQ workloads suit you | Massive HBM bandwidth (2 TB/s), NVLink for multi-card 70B |
| H100 80GB | Volume is below ~1B tokens/month and you don’t need NVLink | Best $/perf at scale, NVLink TP for multi-card 70B FP8, sustained 70B production |
vs 3090: the FP8 question
The 3090 is roughly 40% cheaper used. For models that quantise well to AWQ INT4 (Qwen Coder 32B, Llama 70B), the 3090 is a credible value play. Once you need FP8 — Phi-3 Medium, Llama 3 8B FP8, FLUX FP8 — the 4090 wins because Ampere has no FP8 tensor path and falls back to BF16, halving throughput. See 4090 vs 3090 for full numbers.
vs 5090: the upgrade question
The 5090 has 32 GB and 1792 GB/s — roughly 78% more bandwidth than the 4090. For larger KV budgets (70B at higher concurrency, 32B FP8 instead of AWQ) and FP4 inference (Llama 3.3 70B FP4 fits comfortably), the 5090 is the right buy. For 7-32B AWQ workloads it’s roughly 80-100% faster but 50-80% more expensive — diminishing returns. See 4090 vs 5090 and decision guide.
vs H100: the production-scale question
The H100 has 80 GB HBM3 at 3.35 TB/s — roughly 3.3x the bandwidth of a 4090. For sustained production loads above 1-2B tokens/month or for multi-card 70B FP8 with NVLink, H100 wins on $/perf. Below 500M tokens/month, the 4090 is dramatically cheaper. See 4090 vs H100.
Used vs new pricing in 2026
| Card | Approx. UK price 2026 | £/GB VRAM | £/TFLOPS FP8 | £/MAU served (Llama 8B) |
|---|---|---|---|---|
| RTX 3090 (used) | £550-650 | £25 | n/a (no FP8) | n/a |
| RTX 5060 Ti 16GB (new) | £450-500 | £30 | ~£1.0 | £0.012 |
| RTX 5080 16GB (new) | £1,000-1,150 | £67 | £4.5 | £0.018 |
| RTX 4090 24GB (used) | £1,100-1,300 | £50 | £3.6 | £0.014 |
| RTX 4090 24GB (new, where stocked) | £1,500-1,700 | £67 | £4.8 | £0.018 |
| RTX 5090 32GB | £1,950-2,250 | £66 | £3.8 | £0.013 |
| RTX 6000 Pro 96GB | £8,500+ | £89 | £8.5 | n/a (different scale) |
Used 4090s are the value play in 2026 — same FP8 capability as new, lower acquisition cost, and unaffected by Blackwell launch supply tightness. New 4090s are still being sold in select channels at a premium. Compare with the monthly hosting cost if you’d rather rent than buy.
Named workloads where it stays relevant
| Workload | 4090 verdict | Best alternative |
|---|---|---|
| 7-13B FP8 inference (Llama 3 8B, Phi-3, Mistral 7B) | Best £/token in lineup, beats 5080 on KV headroom | 5060 Ti for cheapest, 5080 for newest tensor |
| Llama 70B INT4 single-card serving | Cheapest card that fits (5060/5080 cannot) | 5090 for higher KV, H100 for production |
| Qwen 2.5 Coder 32B AWQ for coding teams | Sweet spot — fits with FP8 KV, batch 4-8 | 5090 if KV pressure becomes constant |
| FLUX.1-dev image generation | 24 GB fits FP16 with LoRAs comfortably | 5090 for batch generation |
| Whisper large-v3-turbo transcription | ~80x real-time on a single card | any card with 12+ GB |
| QLoRA fine-tuning up to 13B | Excellent — 24 GB fits gradients | 6000 Pro for larger models |
| Mistral Nemo 12B at full 128k context | Just fits at FP8 — only 24+ GB cards can | 5090, 6000 Pro |
Named scenario: SaaS RAG product, 30k MAU
30k MAU on a knowledge-base assistant averaging 30k tokens/user/month is 900M tokens/month. A single 4090 with Qwen 32B AWQ at 70% util handles 400M; you need two cards or move flagship traffic to H100. The 5090 with same model at higher batch handles ~750M on one card — better fit.
Named scenario: 12-engineer coding assistant team
Sweet spot for 4090. Qwen 2.5 Coder 32B AWQ fits comfortably with FP8 KV at max-num-seqs 4 and prefix caching. See the coding assistant guide.
Named scenario: startup MVP at <5k users
4090 is overkill if you’re under 100M tokens/month — use a 5060 Ti for 7-8B workloads or stay on hosted APIs. See startup MVP sizing.
Signals to consider sunsetting
The 4090 starts to lose ground when:
- FP4 quality is acceptable and your workload fits in 16 GB — the 5080 wins on £/token at lower TDP. Check vs 5080.
- You need 32 GB or more — 5090 or 6000 Pro takes over. The KV ceiling on 70B AWQ is the most common driver.
- You need NVLink, ECC, or MIG partitioning — datacentre tiers only.
- You are sustaining 70B FP8 production loads — H100 or 6000 Pro territory; 4090 cannot do 70B FP8 (61 GB).
- Your monthly token volume exceeds 2 B sustained — H100 fleet is more efficient $/token at that scale.
- You need power-efficiency — 4090 at 450W is power-hungry. 5080 at 360W is meaningfully cooler.
See the when-to-upgrade guide and tokens-per-watt analysis.
Production gotchas
- No FP4 on Ada. If your serving stack assumes FP4 (Llama 3.3 70B in FP4 fits a 4090, in theory) the 4090 falls back to AWQ INT4 — fine for most cases but not the same numerical behaviour.
- No NVLink. Multi-4090 deployments rely on PCIe; tensor parallelism scales worse than on H100/A100.
- No ECC. Consumer cards have no ECC memory. For long-running inference this is empirically fine, but compliance reviews will flag it.
- Power and cooling. 450W TDP requires proper colocation cooling and a 12VHPWR cable in good condition. Datacentre-grade hosting is essential — don’t try this in a closet.
- Driver matrix. FP8 marlin kernels need CUDA 12.4+ and R550+ drivers. Older stacks silently fall back to slower BF16 paths.
- Supply variance. New 4090 supply has been intermittent since the 5090 launch. Used market is robust but verify cooler condition.
- Driver compatibility window. NVIDIA datacentre drivers diverge from consumer; some hosting providers ship gaming drivers which lack certain Hopper-style optimisations.
Verdict
In 2026 the RTX 4090 24GB occupies the sweet spot of NVIDIA’s lineup: cheaper than the 5090, more capable than the 5080, with the same FP8 path that ships in datacentre Hopper. It is the cheapest single card that hosts Llama 3.1 70B INT4 and Qwen 2.5 32B AWQ. It will remain the best price-per-token consumer card for FP8-native workloads through at least 2027, when GDDR7-based mid-tier alternatives (a hypothetical 5080 Super 24GB or 6080) might displace it. Until then, the 4090 is the default consumer pick for serious inference.
The 2026 value pick for FP8 inference
24 GB, native FP8, mature toolchain, fits Llama 70B INT4 single-card. UK dedicated hosting.
Order the RTX 4090 24GBSee also: spec breakdown, vs RTX 3090, vs RTX 5080, vs RTX 5090, vs H100, vs A100, 4090 or 3090 decision, 4090 or 5090 decision, when to upgrade, monthly hosting cost.