RTX 4090 24GB Cost per GB of VRAM Analysed GIGAGPU

Compute matters, but for LLM inference VRAM is the gating resource: a model that does not fit cannot run, and once you exceed your card’s envelope your only options (CPU offload, multi-GPU sharding) are dramatically slower or dramatically more expensive. So how cheap is a gigabyte of fast VRAM on the RTX 4090 24GB versus its peers? On UK dedicated GPU hosting the 4090 sits in a remarkable position: best £/GB for native FP8 inference, slightly more expensive per raw gigabyte than a used 3090, far cheaper than any HBM card. This piece walks through hardware list price per GB, hosted price per GB-month, the bandwidth-adjusted index that tells you what you are really paying for, and the model fit table that matters most.

Why VRAM is the binding constraint

If your model plus KV cache exceeds VRAM, your only options are CPU offload (10-30x slower because every layer round-trips through PCIe Gen 4 at 26 GB/s rather than HBM at 1008 GB/s), tensor parallel across GPUs (often impractical on Ada without NVLink, see the PCIe piece), or a smaller model. None of those preserve product economics. So the relevant unit cost for inference infrastructure is gigabytes of fast VRAM per pound, weighted by the bandwidth and tensor TFLOPS attached to each gigabyte.

The “fast” qualifier matters. A gigabyte of GDDR6X at 84 GB/s per chip is not the same product as a gigabyte of HBM3 at 800 GB/s per stack. For LLM decode, where every weight is streamed once per token, the bandwidth attached to the gigabyte determines tokens per second. For diffusion, where the U-Net is more compute-bound, the FP8 TFLOPS attached to the gigabyte matters more. Cost-per-GB analysis without these adjustments is a misleading number.

Hardware list price per GB

Approximate UK street prices, ex-VAT, as of mid-2026. Used market in brackets where relevant.

GPU	VRAM	Bandwidth	FP8 native	List GBP	£/GB
RTX 3090 24GB	24 GB GDDR6X	936 GB/s	No	£700 (used)	£29
RTX 4090 24GB	24 GB GDDR6X	1008 GB/s	Yes	£1,750	£73
RTX 4090 D 48GB (modded)	48 GB GDDR6X	1008 GB/s	Yes	£3,400 (grey)	£71
RTX 5090 32GB	32 GB GDDR7	1792 GB/s	Yes (FP4 too)	£2,400	£75
RTX 6000 Ada 48GB	48 GB GDDR6 ECC	960 GB/s	Yes	£7,200	£150
L40S 48GB	48 GB GDDR6 ECC	864 GB/s	Yes	£8,400	£175
RTX 6000 Pro 96GB	96 GB GDDR7 ECC	1792 GB/s	Yes (FP4 too)	£8,500	£89
A100 80GB SXM	80 GB HBM2e	2039 GB/s	No	£15,500	£194
H100 SXM 80GB	80 GB HBM3	3350 GB/s	Yes	£28,000	£350

The 4090 lands at £73/GB, sandwiched between the cheap 3090 and the more expensive Pro-Ada cards. Pure VRAM/£ would point at a used 3090, but you also lose 70 percent of the FP8 throughput, native FP8 entirely, and the 12x larger L2 cache. The 4090 D 48GB is the grey-market mod that fits 48 GB of GDDR6X on a 4090 die; it is the cheapest path to 48 GB of FP8-capable VRAM, but the cards are not officially supported and lack warranty.

Hosted price per GB-month

Buying outright includes power, cooling, networking, and depreciation that you have to amortise yourself. Hosted price per GB-month is closer to the real economic measure for production workloads where someone else handles the infrastructure.

GPU	VRAM	Indicative £/month	£/GB-month	Notes
RTX 3090 24GB	24	£199	£8.30	No FP8, NVLink available
RTX 4090 24GB	24	£329	£13.70	FP8, AV1 NVENC, single best £/GB-FP8
RTX 5090 32GB	32	£499	£15.60	FP4, GDDR7, 1.78x bandwidth
RTX 6000 Ada 48GB	48	£899	£18.70	ECC, fits 70B FP8
RTX 6000 Pro 96GB	96	£1,199	£12.50	FP4, fits 70B FP16, best £/GB at the top
A100 80GB	80	£1,599	£20.00	HBM2e, no FP8
H100 80GB	80	£2,399	£30.00	HBM3, FP8, NVLink

The 4090 24GB at £13.70/GB-month is the second-cheapest hosted gigabyte (after the 3090) and the cheapest with native FP8. The RTX 6000 Pro 96GB at £12.50/GB-month actually beats the 4090 once you need 96 GB, because the larger card amortises the chassis and CPU cost over more memory. The pivot point is roughly 70 GB of working set: below that, the 4090 wins; above that, the RTX 6000 Pro wins.

Adjusted for bandwidth, FP8 and TFLOPS

VRAM alone is misleading because the gigabytes are not interchangeable. A bandwidth-adjusted index (£ per GB-month divided by bandwidth in TB/s) tells you what a gigabyte of fast VRAM actually costs.

GPU	£/GB-month	BW (TB/s)	FP8 dense	£/(GB-month x TB/s) (lower better)	FP8 effective £/GB-month
RTX 3090 24GB	£8.30	0.94	n/a	£8.83	n/a (no FP8)
RTX 4090 24GB	£13.70	1.01	660	£13.56	£6.85 (FP8 halves model bytes)
RTX 5090 32GB	£15.60	1.79	838	£8.71	£7.80
RTX 6000 Pro 96GB	£12.50	1.79	838	£6.98	£6.25
A100 80GB	£20.00	2.04	n/a	£9.80	n/a
H100 80GB	£30.00	3.35	1979	£8.96	£15.00

The 4090 looks middling on raw bandwidth-adjusted index but has a special property: it is the cheapest card with native FP8. That cuts model bytes in half, so effective £/GB-FP8 drops to £6.85, beating every card except the RTX 6000 Pro 96GB. For a typical production LLM workload running FP8, the 4090 is the lowest-cost-per-FP8-gigabyte option in the consumer tier and beats every datacentre card except the very largest GDDR7 SKU. For a 200-MAU SaaS RAG that needs to fit 12B model + KV in 24 GB, the 4090 delivers the gigabyte at half the price of an H100.

Models that fit on 24 GB

Model	Format	Weights	KV @ 16k FP8	Total @ 16k	Headroom
Llama 3.1 8B	FP16	16 GB	1 GB	17.5 GB	6.5 GB (8x batch)
Llama 3.1 8B	FP8	8 GB	0.5 GB	9.0 GB	15 GB (32x batch)
Mistral 7B v0.3	FP8	7.25 GB	0.4 GB (sliding)	8.4 GB	15.6 GB
Mistral Nemo 12B	FP8	12.2 GB	0.75 GB	14.0 GB	10 GB
Mistral Small 3 24B	AWQ INT4	13 GB	1.5 GB	15.5 GB	8.5 GB
Llama 3.1 70B	AWQ INT4	17 GB	2.6 GB	21 GB	3 GB (4x batch)
Qwen 2.5 14B	AWQ INT4	9 GB	1.5 GB	11.5 GB	12.5 GB
Qwen 2.5 32B	AWQ INT4	17 GB	2 GB	20 GB	4 GB
Mixtral 8x7B	AWQ INT4	25 GB	n/a	OOM	—
FLUX.1-dev	FP16	23 GB	n/a	23 GB	1 GB (single image)

The 24 GB envelope is enough for every flagship open model up to 70B if you accept INT4 weights and FP8 KV. Mixtral 8x7B is the only commonly-deployed model that does not fit; everything else is a question of how aggressive your quantisation can be. For a 12-engineer coding team running Qwen 2.5 14B AWQ on a single 4090, the 12 GB headroom supports 16 concurrent active sessions at 32k context, which is more than the team will sustain even at peak.

Cost-per-GB by real workload

Workload	Working set	Best card	£/month	Effective £/GB-month-used
Llama 3.1 8B FP8 chat	10 GB	4090 24GB	£329	£32.90
Mistral Nemo 12B FP8 RAG	14 GB	4090 24GB	£329	£23.50
Llama 3.1 70B AWQ chat	21 GB	4090 24GB	£329	£15.66
Llama 3.1 70B FP8 (full quality)	76 GB	RTX 6000 Pro 96GB	£1,199	£15.78
Mixtral 8x7B AWQ	25 GB	5090 32GB	£499	£19.96
FLUX.1-dev FP16	23 GB	4090 24GB	£329	£14.30

The 4090 wins on £/GB-of-actual-working-set for every workload that fits the 24 GB envelope. The 70B AWQ at £15.66/GB-used is a particularly striking number: it is within 1 percent of the £15.78 that the same model at full FP8 quality costs on the RTX 6000 Pro 96GB. The choice between them collapses to whether you can tolerate the 1.5 percent MMLU drop from AWQ INT4 to FP8.

Production gotchas

Headroom for batched serving is not optional. A model that fits at 23/24 GB single-stream cannot serve batched traffic. Plan for at least 3-4 GB of free VRAM for KV cache growth at concurrent users.
FP8 KV calibration is checkpoint-dependent. Some FP8 weights checkpoints behave badly with E5M2 KV at long context. Validate before committing your sizing.
Driver overhead is non-zero. Expect 0.5-0.8 GB of CUDA + driver overhead on a fresh 4090 before any model loads. Account for it.
Two models on one card competes for VRAM. Running Whisper alongside Llama 3.1 8B on the same 4090 is feasible (Whisper Turbo INT8 is ~2 GB) but tight. Plan VRAM as an exclusive resource per workload class.
The 4090 D 48GB grey-market mod looks tempting. It is the cheapest 48 GB card on the market but lacks warranty, fails some firmware checks, and may not pass driver updates beyond CUDA 12.x. Use only with eyes open.
Hosted £/month is sticker, not all-in. Egress, NVMe storage, and bandwidth overage can add 10-20 percent. Read the contract.
VRAM market price is volatile. The 3090 used market spikes when consumer 4090 supply tightens. Hosted £/GB-month is more stable than hardware list.

Verdict and when 24 GB is the right answer

Pick the 4090 24GB on £/GB grounds when: your model fits 24 GB at FP8 or AWQ INT4 (everything up to Llama 70B INT4, Mistral Small 3 24B INT4, Qwen 2.5 32B AWQ, FLUX.1-dev); you need native FP8 (the cheapest card on the market for it); you serve 1-32 concurrent users at the per-stream rates the 4090 delivers; you want UK-hosted dedicated metal at a known monthly cost rather than per-second cloud billing surprises.

Skip the 4090 if your working set exceeds 23 GB even at INT4 (move to RTX 6000 Pro 96GB which has the best £/GB at the top end), if you need ECC for compliance reasons (RTX 6000 Ada or L40S), or if your batch is consistently above 64 with very long context (H100 HBM3 pulls decisively ahead at that scale). For the migration paths see the 3090 comparison, 5090 comparison, H100 comparison, and 4090 or 3090 decision.

Best £/GB-month for native FP8 inference, hosted in the UK

24 GB GDDR6X with full FP8 tensor core support. UK dedicated hosting.

Order the RTX 4090 24GB

RTX 4090 24GB Cost per GB of VRAM Analysed

Contents

Why VRAM is the binding constraint

Hardware list price per GB

Hosted price per GB-month

Adjusted for bandwidth, FP8 and TFLOPS

Models that fit on 24 GB

Cost-per-GB by real workload

Production gotchas

Verdict and when 24 GB is the right answer

Best £/GB-month for native FP8 inference, hosted in the UK

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB Cost per GB of VRAM Analysed

Contents

Why VRAM is the binding constraint

Hardware list price per GB

Hosted price per GB-month

Adjusted for bandwidth, FP8 and TFLOPS

Models that fit on 24 GB

Cost-per-GB by real workload

Production gotchas

Verdict and when 24 GB is the right answer

Best £/GB-month for native FP8 inference, hosted in the UK

Need a Dedicated GPU Server?

gigagpu

Related Articles

Heterogeneous Multi-GPU Workload Split – Different Cards, One Server

RTX 5060 Ti 16GB Multi-Card Pairing

AI Data Pipeline: Batch vs Stream

AI MLOps Stack in 2026

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?