When to Upgrade From an RTX 4090 24GB GIGAGPU

The RTX 4090 24GB is the longest-running sweet-spot in NVIDIA’s consumer-derived inference stack. It still beats most cards on cost-per-token for 8B-class FP8 chat, runs Llama 3.1 70B at AWQ INT4 cleanly, and pulls 195 t/s of decode out of 1 TB/s of GDDR6X. But every successful deployment eventually outgrows its silicon, and a thoughtful upgrade decision is worth more than another month of running the queue hot. This guide lists the concrete symptoms that mean you have hit the ceiling on a dedicated 4090, the right upgrade target for each, the cost delta versus the capability delta, and the workload tweaks worth trying before you write the cheque. Targets all live in the wider UK GPU range.

Six symptoms you have outgrown the 4090

Upgrade signals are concrete – they show up in the metrics, not the gut. If you are not seeing at least one of the following, the 4090 is still the right card.

1. OOM on the model you actually want to run

Llama 70B FP8 needs ~38GB of weights plus KV. Qwen 2.5 32B at FP16 needs ~65GB. Mixtral 8x22B AWQ INT4 needs ~70GB. None fit on 24GB at production-grade quantisation. AWQ INT4 buys you Llama 70B at ~22 t/s and is documented in the 70B INT4 deployment guide, but the moment your evals demand FP8 quality, 24GB is the wall.

2. Concurrency saturation

Aggregate throughput on Llama 8B FP8 plateaus around 1,800 t/s with batch 32 on a single 4090. If your traffic regularly pushes past that ceiling and TTFT starts climbing past your SLA, the card is at the edge of its bandwidth. The concurrent users post walks through the saturation curve in detail.

3. KV cache thrashing on long context

128k-context requests against an 8B model can consume 18GB of KV alone in FP16. Paged-attention warnings, falling effective batch size, and rising TTFT all point at memory pressure rather than compute.

4. Image gen and LLM both at peak simultaneously

Sharing a 4090 between vLLM and SDXL works at low load but breaks under burst. The two workloads contend for VRAM and SM time and you see periodic stalls in both pipelines.

5. Fine-tuning 70B with reasonable batch size

QLoRA on 70B works on 24GB but at sequence length 512 and micro-batch 1. Anything bigger needs more VRAM or HBM bandwidth – covered in the fine-tune throughput and best fine-tuning GPU guides.

6. Adding more 4090s loses per-pound to a single bigger card

Two 4090s give 48GB at ~£1,150/month but tensor-parallel scaling caps at ~1.6x. A single RTX 6000 Pro at £1,650/month gives 96GB without any coordination tax. Once you need a third or fourth 4090, the workstation card wins on TCO.

Upgrade options at a glance

Target	VRAM	Approx £/mo	Cost delta	What it solves
RTX 5090 32GB	32GB GDDR7	£900	+57%	Concurrency, TTFT, FP4, marginal VRAM
RTX 6000 Pro 96GB	96GB GDDR7 ECC	£2,200	+283%	70B FP8 native, 180B AWQ, ECC, dense rack form
2x RTX 4090 24GB	48GB combined	£1,150	+100%	Throughput doubling (replica), 70B FP8 via TP=2
H100 80GB	80GB HBM3	£2,500-3,500	+335%	FP8 throughput crown, NVLink, MIG, training
A100 80GB	80GB HBM2e	£1,800	+213%	Big VRAM cheaper than H100, no native FP8

Path A: jump to the RTX 5090 32GB

The 5090 is the natural successor for throughput-bound deployments where the model already fits. Blackwell GB202 brings 21,760 CUDA cores, 1,792 GB/s of GDDR7 bandwidth, and native FP4 tensor cores. Decode is bandwidth-bound on transformer LLMs, so the +78% bandwidth translates directly into +44% on Llama 8B FP8 batch 1 and +50% on aggregate batch 8.

Workload	4090 t/s	5090 t/s	Uplift
Llama 3.1 8B FP8 batch 1	198	280	+41%
Llama 3.1 8B FP8 aggregate batch 32	1,100	1,700	+55%
Llama 3.1 70B AWQ INT4 batch 1	22	36	+64%
Qwen 2.5 14B FP8 batch 1	120	175	+46%
SDXL 1024×1024 30 steps	3.4s	2.1s	+62%
Flux.1 Dev 1024×1024	14s	8.5s	+65%

The 32GB ceiling lets you run Llama 70B AWQ comfortably with 32k context, Qwen 32B FP8 cleanly, and Mixtral 8x7B AWQ with full KV. Cost-per-token is roughly flat – you pay 57% more per month for ~50% more throughput – but the indirect wins (lower TTFT, FP4 readiness, headroom for the next model) usually justify the move. Decision logic is in the 4090-or-5090 decision post and the spec comparison.

Path B: workstation-class RTX 6000 Pro 96GB

This is the right move when the symptom is “the model does not fit”. 96GB of GDDR7 ECC and a 300W TDP sit in a 2-slot blower form factor that drops cleanly into dense racks. Bandwidth is ~1,400 GB/s – between the 4090 and the 5090 – so per-token throughput on 8B chat is similar to a 4090 but everything in the 70B-180B band suddenly becomes available.

Model	4090 24GB	6000 Pro 96GB
Llama 3.1 70B FP8	OOM (38GB needed)	Fits with full FP16 KV
Llama 3.1 405B AWQ INT4	OOM	~95GB – tight but fits
Mixtral 8x22B AWQ	OOM	Fits cleanly
Qwen 2.5 72B FP8	OOM	Fits
Falcon 180B AWQ INT4	OOM	~95GB – tight but fits
Long-context 128k Llama 70B	Cannot	Comfortable with paged KV

The cost delta is real – £2,200/month versus £575 – but on a £/GB-VRAM basis it is actually cheaper than aggregating four 4090s in one chassis, with the bonus of ECC and an optional NVLink pair to scale to 192GB. The full case is laid out in the 6000 Pro upgrade post and the vs 6000 Pro deep-dive.

Path C: add a second 4090

The cheapest way to double aggregate throughput if your model already fits on one card. Two 4090s with vLLM tensor parallelism share Llama 70B FP8 (35GB across two cards) at ~38-42 t/s decode – close to a 6000 Pro for half the price. Replica-mode scaling on small models hits ~1.97x. Tensor-parallel scaling on 70B caps near 1.6-1.74x because the 4090 has no NVLink and all-reduce travels via PCIe Gen4 at ~32 GB/s peer.

Topology matters: both cards on the same PCIe root complex (look for PIX or PHB in nvidia-smi topo -m) gives the best comms; cards on different sockets pay another ~30% in NCCL latency. Full configuration in the multi-card pairing guide.

Path D: H100 80GB territory

An H100 is overkill for any workload a 4090 already serves, and roughly 5x the cost. It earns its keep when you need 70B FP8 at >55 t/s decode, NVLink-bridged training across multiple cards, MIG partitioning into seven isolated tenants, or aggregate throughput past 5,000 t/s on small-model inference. UK dedicated H100 lands around £2,500-3,500/month; cloud H100 PCIe at Lambda or RunPod runs $2.49-2.99/hour. The detailed comparison sits in the vs cloud H100 post and the spec deep-dive.

Cheaper things to try first

Before spending another £400-2,000/month, run through this checklist. Most teams recover 20-40% headroom without changing hardware.

FP8 KV cache. Set --kv-cache-dtype fp8 in vLLM. Halves KV memory at <0.5% quality loss on most models.
AWQ or GPTQ quantisation. Drop Llama 70B from 38GB FP8 to 21GB AWQ INT4 – it fits on a single 4090. See the AWQ guide.
Reduce --max-model-len. 128k context allocations leak even when most requests are 4k. Set the cap to your real p99.
Increase --gpu-memory-utilization. Default 0.90 leaves ~2.4GB on the table. 0.94 is safe on 4090.
Speculative decoding. A 1B draft model can lift 70B throughput 1.5-2x on chat workloads.
Prefix caching. RAG and agent workloads with stable system prompts see 30-60% TTFT reduction.
Move embeddings off the LLM card. A £75/month 5060 Ti can host BGE-M3 – covered in the hybrid pairing guide.

If you have done all of the above and still see saturation, the upgrade signal is real.

Production gotchas

vLLM version pinning. Blackwell support landed in vLLM 0.6.4 with sm_120 kernels. Earlier versions silently fall back to slow paths on the 5090.
CUDA toolkit drift. The 6000 Pro and 5090 both want CUDA 12.8+. Container images built against 12.4 will boot but run 30% slower on Blackwell.
Power budget on dual-4090 hosts. 2x 450W plus host overhead pushes a 1200W PSU close to 90% load. Insist on 1600W with quality 12V rails.
NCCL topology surprises. Some hosts route GPU-to-GPU via the chipset (PXB) not the CPU root (PHB). The difference is 25 GB/s vs 32 GB/s peer – measurable on TP=2 70B.
FP4 weight regeneration. If you upgrade to a 5090 to use FP4, you need to re-quantise. Existing FP8 weights still work but you leave performance on the table.
NVLink-pair availability. The 6000 Pro NVLink option requires both cards in the same chassis with the bridge SKU. Verify with the host before ordering.
MIG and the consumer line. No 4090, 5090 or 6000 Pro supports MIG. If you need hard tenant isolation on one card, only H100 / A100 will do.

Payback timelines and the verdict

The right upgrade depends on what you currently waste. If you turn away paying customers, the upgrade pays for itself the moment you can serve them. If you are over-provisioned, no upgrade has positive ROI.

Scenario	Best target	Payback signal
Throughput plateau, model fits	5090 32GB or 2x 4090 (replica)	SLA breaches under p95 traffic
Need 70B FP8 quality	2x 4090 TP=2, or 6000 Pro	Eval scores below threshold on AWQ
Need 180B/405B-class	6000 Pro 96GB (or NVLink pair)	Customer demand for frontier-class
Sub-200ms TTFT at high concurrency	H100 80GB	UX latency budget breached
Long-context 128k production	6000 Pro or H100	OOM on KV at p95 context length
FP4 weight format experiments	5090 (Blackwell native FP4)	Research roadmap explicitly needs FP4
Multi-tenant isolation required	H100 with MIG	Compliance / SLA isolation contracts

Verdict. The 4090 stays the right card for cost-per-token-bound 8B-to-14B FP8 inference and for 70B AWQ INT4 deployments. The 5090 is the throughput refresh when latency or VRAM pressure shows up. The 6000 Pro is the model-fit upgrade when 70B FP8 or 180B AWQ becomes table stakes. The H100 is reserved for training, very high concurrency, NVLink-bridged 70B+, or hard tenant isolation. And the cheapest first move is almost always the configuration tweaks in the alternatives section – quantisation, FP8 KV, prefix caching – which buy 20-40% headroom for the cost of a deploy.

Stay on the workhorse, or scale up cleanly

RTX 4090 24GB, 5090 32GB, RTX 6000 Pro 96GB and H100 80GB all available on UK dedicated hosting.

Order the RTX 4090 24GB

When to Upgrade From an RTX 4090 24GB

Contents

Six symptoms you have outgrown the 4090

1. OOM on the model you actually want to run

2. Concurrency saturation

3. KV cache thrashing on long context

4. Image gen and LLM both at peak simultaneously

5. Fine-tuning 70B with reasonable batch size

6. Adding more 4090s loses per-pound to a single bigger card

Upgrade options at a glance

Path A: jump to the RTX 5090 32GB

Path B: workstation-class RTX 6000 Pro 96GB

Path C: add a second 4090

Path D: H100 80GB territory

Cheaper things to try first

Production gotchas

Payback timelines and the verdict

Stay on the workhorse, or scale up cleanly

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

When to Upgrade From an RTX 4090 24GB

Contents

Six symptoms you have outgrown the 4090

1. OOM on the model you actually want to run

2. Concurrency saturation

3. KV cache thrashing on long context

4. Image gen and LLM both at peak simultaneously

5. Fine-tuning 70B with reasonable batch size

6. Adding more 4090s loses per-pound to a single bigger card

Upgrade options at a glance

Path A: jump to the RTX 5090 32GB

Path B: workstation-class RTX 6000 Pro 96GB

Path C: add a second 4090

Path D: H100 80GB territory

Cheaper things to try first

Production gotchas

Payback timelines and the verdict

Stay on the workhorse, or scale up cleanly

Need a Dedicated GPU Server?

gigagpu

Related Articles

Top Together.ai Alternatives for LLM Hosting

vLLM vs TensorRT-LLM

OpenAI Data Privacy for Healthcare AI

Best Replicate Alternatives for AI Inference

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?