The RTX 4090 24GB is the longest-running sweet-spot in NVIDIA’s consumer-derived inference stack. It still beats most cards on cost-per-token for 8B-class FP8 chat, runs Llama 3.1 70B at AWQ INT4 cleanly, and pulls 195 t/s of decode out of 1 TB/s of GDDR6X. But every successful deployment eventually outgrows its silicon, and a thoughtful upgrade decision is worth more than another month of running the queue hot. This guide lists the concrete symptoms that mean you have hit the ceiling on a dedicated 4090, the right upgrade target for each, the cost delta versus the capability delta, and the workload tweaks worth trying before you write the cheque. Targets all live in the wider UK GPU range.
Contents
- Six symptoms you have outgrown the 4090
- Upgrade options at a glance
- Path A: jump to the RTX 5090 32GB
- Path B: workstation-class RTX 6000 Pro 96GB
- Path C: add a second 4090
- Path D: H100 80GB territory
- Cheaper things to try first
- Payback timelines and the verdict
Six symptoms you have outgrown the 4090
Upgrade signals are concrete – they show up in the metrics, not the gut. If you are not seeing at least one of the following, the 4090 is still the right card.
1. OOM on the model you actually want to run
Llama 70B FP8 needs ~38GB of weights plus KV. Qwen 2.5 32B at FP16 needs ~65GB. Mixtral 8x22B AWQ INT4 needs ~70GB. None fit on 24GB at production-grade quantisation. AWQ INT4 buys you Llama 70B at ~22 t/s and is documented in the 70B INT4 deployment guide, but the moment your evals demand FP8 quality, 24GB is the wall.
2. Concurrency saturation
Aggregate throughput on Llama 8B FP8 plateaus around 1,800 t/s with batch 32 on a single 4090. If your traffic regularly pushes past that ceiling and TTFT starts climbing past your SLA, the card is at the edge of its bandwidth. The concurrent users post walks through the saturation curve in detail.
3. KV cache thrashing on long context
128k-context requests against an 8B model can consume 18GB of KV alone in FP16. Paged-attention warnings, falling effective batch size, and rising TTFT all point at memory pressure rather than compute.
4. Image gen and LLM both at peak simultaneously
Sharing a 4090 between vLLM and SDXL works at low load but breaks under burst. The two workloads contend for VRAM and SM time and you see periodic stalls in both pipelines.
5. Fine-tuning 70B with reasonable batch size
QLoRA on 70B works on 24GB but at sequence length 512 and micro-batch 1. Anything bigger needs more VRAM or HBM bandwidth – covered in the fine-tune throughput and best fine-tuning GPU guides.
6. Adding more 4090s loses per-pound to a single bigger card
Two 4090s give 48GB at ~£1,150/month but tensor-parallel scaling caps at ~1.6x. A single RTX 6000 Pro at £1,650/month gives 96GB without any coordination tax. Once you need a third or fourth 4090, the workstation card wins on TCO.
Upgrade options at a glance
| Target | VRAM | Approx £/mo | Cost delta | What it solves |
|---|---|---|---|---|
| RTX 5090 32GB | 32GB GDDR7 | £900 | +57% | Concurrency, TTFT, FP4, marginal VRAM |
| RTX 6000 Pro 96GB | 96GB GDDR7 ECC | £2,200 | +283% | 70B FP8 native, 180B AWQ, ECC, dense rack form |
| 2x RTX 4090 24GB | 48GB combined | £1,150 | +100% | Throughput doubling (replica), 70B FP8 via TP=2 |
| H100 80GB | 80GB HBM3 | £2,500-3,500 | +335% | FP8 throughput crown, NVLink, MIG, training |
| A100 80GB | 80GB HBM2e | £1,800 | +213% | Big VRAM cheaper than H100, no native FP8 |
Path A: jump to the RTX 5090 32GB
The 5090 is the natural successor for throughput-bound deployments where the model already fits. Blackwell GB202 brings 21,760 CUDA cores, 1,792 GB/s of GDDR7 bandwidth, and native FP4 tensor cores. Decode is bandwidth-bound on transformer LLMs, so the +78% bandwidth translates directly into +44% on Llama 8B FP8 batch 1 and +50% on aggregate batch 8.
| Workload | 4090 t/s | 5090 t/s | Uplift |
|---|---|---|---|
| Llama 3.1 8B FP8 batch 1 | 198 | 280 | +41% |
| Llama 3.1 8B FP8 aggregate batch 32 | 1,100 | 1,700 | +55% |
| Llama 3.1 70B AWQ INT4 batch 1 | 22 | 36 | +64% |
| Qwen 2.5 14B FP8 batch 1 | 120 | 175 | +46% |
| SDXL 1024×1024 30 steps | 3.4s | 2.1s | +62% |
| Flux.1 Dev 1024×1024 | 14s | 8.5s | +65% |
The 32GB ceiling lets you run Llama 70B AWQ comfortably with 32k context, Qwen 32B FP8 cleanly, and Mixtral 8x7B AWQ with full KV. Cost-per-token is roughly flat – you pay 57% more per month for ~50% more throughput – but the indirect wins (lower TTFT, FP4 readiness, headroom for the next model) usually justify the move. Decision logic is in the 4090-or-5090 decision post and the spec comparison.
Path B: workstation-class RTX 6000 Pro 96GB
This is the right move when the symptom is “the model does not fit”. 96GB of GDDR7 ECC and a 300W TDP sit in a 2-slot blower form factor that drops cleanly into dense racks. Bandwidth is ~1,400 GB/s – between the 4090 and the 5090 – so per-token throughput on 8B chat is similar to a 4090 but everything in the 70B-180B band suddenly becomes available.
| Model | 4090 24GB | 6000 Pro 96GB |
|---|---|---|
| Llama 3.1 70B FP8 | OOM (38GB needed) | Fits with full FP16 KV |
| Llama 3.1 405B AWQ INT4 | OOM | ~95GB – tight but fits |
| Mixtral 8x22B AWQ | OOM | Fits cleanly |
| Qwen 2.5 72B FP8 | OOM | Fits |
| Falcon 180B AWQ INT4 | OOM | ~95GB – tight but fits |
| Long-context 128k Llama 70B | Cannot | Comfortable with paged KV |
The cost delta is real – £2,200/month versus £575 – but on a £/GB-VRAM basis it is actually cheaper than aggregating four 4090s in one chassis, with the bonus of ECC and an optional NVLink pair to scale to 192GB. The full case is laid out in the 6000 Pro upgrade post and the vs 6000 Pro deep-dive.
Path C: add a second 4090
The cheapest way to double aggregate throughput if your model already fits on one card. Two 4090s with vLLM tensor parallelism share Llama 70B FP8 (35GB across two cards) at ~38-42 t/s decode – close to a 6000 Pro for half the price. Replica-mode scaling on small models hits ~1.97x. Tensor-parallel scaling on 70B caps near 1.6-1.74x because the 4090 has no NVLink and all-reduce travels via PCIe Gen4 at ~32 GB/s peer.
Topology matters: both cards on the same PCIe root complex (look for PIX or PHB in nvidia-smi topo -m) gives the best comms; cards on different sockets pay another ~30% in NCCL latency. Full configuration in the multi-card pairing guide.
Path D: H100 80GB territory
An H100 is overkill for any workload a 4090 already serves, and roughly 5x the cost. It earns its keep when you need 70B FP8 at >55 t/s decode, NVLink-bridged training across multiple cards, MIG partitioning into seven isolated tenants, or aggregate throughput past 5,000 t/s on small-model inference. UK dedicated H100 lands around £2,500-3,500/month; cloud H100 PCIe at Lambda or RunPod runs $2.49-2.99/hour. The detailed comparison sits in the vs cloud H100 post and the spec deep-dive.
Cheaper things to try first
Before spending another £400-2,000/month, run through this checklist. Most teams recover 20-40% headroom without changing hardware.
- FP8 KV cache. Set
--kv-cache-dtype fp8in vLLM. Halves KV memory at <0.5% quality loss on most models. - AWQ or GPTQ quantisation. Drop Llama 70B from 38GB FP8 to 21GB AWQ INT4 – it fits on a single 4090. See the AWQ guide.
- Reduce
--max-model-len. 128k context allocations leak even when most requests are 4k. Set the cap to your real p99. - Increase
--gpu-memory-utilization. Default 0.90 leaves ~2.4GB on the table. 0.94 is safe on 4090. - Speculative decoding. A 1B draft model can lift 70B throughput 1.5-2x on chat workloads.
- Prefix caching. RAG and agent workloads with stable system prompts see 30-60% TTFT reduction.
- Move embeddings off the LLM card. A £75/month 5060 Ti can host BGE-M3 – covered in the hybrid pairing guide.
If you have done all of the above and still see saturation, the upgrade signal is real.
Production gotchas
- vLLM version pinning. Blackwell support landed in vLLM 0.6.4 with sm_120 kernels. Earlier versions silently fall back to slow paths on the 5090.
- CUDA toolkit drift. The 6000 Pro and 5090 both want CUDA 12.8+. Container images built against 12.4 will boot but run 30% slower on Blackwell.
- Power budget on dual-4090 hosts. 2x 450W plus host overhead pushes a 1200W PSU close to 90% load. Insist on 1600W with quality 12V rails.
- NCCL topology surprises. Some hosts route GPU-to-GPU via the chipset (PXB) not the CPU root (PHB). The difference is 25 GB/s vs 32 GB/s peer – measurable on TP=2 70B.
- FP4 weight regeneration. If you upgrade to a 5090 to use FP4, you need to re-quantise. Existing FP8 weights still work but you leave performance on the table.
- NVLink-pair availability. The 6000 Pro NVLink option requires both cards in the same chassis with the bridge SKU. Verify with the host before ordering.
- MIG and the consumer line. No 4090, 5090 or 6000 Pro supports MIG. If you need hard tenant isolation on one card, only H100 / A100 will do.
Payback timelines and the verdict
The right upgrade depends on what you currently waste. If you turn away paying customers, the upgrade pays for itself the moment you can serve them. If you are over-provisioned, no upgrade has positive ROI.
| Scenario | Best target | Payback signal |
|---|---|---|
| Throughput plateau, model fits | 5090 32GB or 2x 4090 (replica) | SLA breaches under p95 traffic |
| Need 70B FP8 quality | 2x 4090 TP=2, or 6000 Pro | Eval scores below threshold on AWQ |
| Need 180B/405B-class | 6000 Pro 96GB (or NVLink pair) | Customer demand for frontier-class |
| Sub-200ms TTFT at high concurrency | H100 80GB | UX latency budget breached |
| Long-context 128k production | 6000 Pro or H100 | OOM on KV at p95 context length |
| FP4 weight format experiments | 5090 (Blackwell native FP4) | Research roadmap explicitly needs FP4 |
| Multi-tenant isolation required | H100 with MIG | Compliance / SLA isolation contracts |
Verdict. The 4090 stays the right card for cost-per-token-bound 8B-to-14B FP8 inference and for 70B AWQ INT4 deployments. The 5090 is the throughput refresh when latency or VRAM pressure shows up. The 6000 Pro is the model-fit upgrade when 70B FP8 or 180B AWQ becomes table stakes. The H100 is reserved for training, very high concurrency, NVLink-bridged 70B+, or hard tenant isolation. And the cheapest first move is almost always the configuration tweaks in the alternatives section – quantisation, FP8 KV, prefix caching – which buy 20-40% headroom for the cost of a deploy.
Stay on the workhorse, or scale up cleanly
RTX 4090 24GB, 5090 32GB, RTX 6000 Pro 96GB and H100 80GB all available on UK dedicated hosting.
Order the RTX 4090 24GBSee also: upgrade to 5090, upgrade to 6000 Pro, multi-card pairing, vs cloud H100, tier positioning 2026, ROI analysis, monthly hosting cost.