RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / RTX 4090 24GB Cost per GB of VRAM Analysed
AI Hosting & Infrastructure

RTX 4090 24GB Cost per GB of VRAM Analysed

Senior infra engineer's analysis of cost per gigabyte of VRAM on the RTX 4090 24GB versus 3090, 5090, A6000, L40S, A100 and H100, with hardware list price, hosted £/GB-month, bandwidth-adjusted index and the workloads that actually fit 24 GB.

Compute matters, but for LLM inference VRAM is the gating resource: a model that does not fit cannot run, and once you exceed your card’s envelope your only options (CPU offload, multi-GPU sharding) are dramatically slower or dramatically more expensive. So how cheap is a gigabyte of fast VRAM on the RTX 4090 24GB versus its peers? On UK dedicated GPU hosting the 4090 sits in a remarkable position: best £/GB for native FP8 inference, slightly more expensive per raw gigabyte than a used 3090, far cheaper than any HBM card. This piece walks through hardware list price per GB, hosted price per GB-month, the bandwidth-adjusted index that tells you what you are really paying for, and the model fit table that matters most.

Contents

Why VRAM is the binding constraint

If your model plus KV cache exceeds VRAM, your only options are CPU offload (10-30x slower because every layer round-trips through PCIe Gen 4 at 26 GB/s rather than HBM at 1008 GB/s), tensor parallel across GPUs (often impractical on Ada without NVLink, see the PCIe piece), or a smaller model. None of those preserve product economics. So the relevant unit cost for inference infrastructure is gigabytes of fast VRAM per pound, weighted by the bandwidth and tensor TFLOPS attached to each gigabyte.

The “fast” qualifier matters. A gigabyte of GDDR6X at 84 GB/s per chip is not the same product as a gigabyte of HBM3 at 800 GB/s per stack. For LLM decode, where every weight is streamed once per token, the bandwidth attached to the gigabyte determines tokens per second. For diffusion, where the U-Net is more compute-bound, the FP8 TFLOPS attached to the gigabyte matters more. Cost-per-GB analysis without these adjustments is a misleading number.

Hardware list price per GB

Approximate UK street prices, ex-VAT, as of mid-2026. Used market in brackets where relevant.

GPUVRAMBandwidthFP8 nativeList GBP£/GB
RTX 3090 24GB24 GB GDDR6X936 GB/sNo£700 (used)£29
RTX 4090 24GB24 GB GDDR6X1008 GB/sYes£1,750£73
RTX 4090 D 48GB (modded)48 GB GDDR6X1008 GB/sYes£3,400 (grey)£71
RTX 5090 32GB32 GB GDDR71792 GB/sYes (FP4 too)£2,400£75
RTX 6000 Ada 48GB48 GB GDDR6 ECC960 GB/sYes£7,200£150
L40S 48GB48 GB GDDR6 ECC864 GB/sYes£8,400£175
RTX 6000 Pro 96GB96 GB GDDR7 ECC1792 GB/sYes (FP4 too)£8,500£89
A100 80GB SXM80 GB HBM2e2039 GB/sNo£15,500£194
H100 SXM 80GB80 GB HBM33350 GB/sYes£28,000£350

The 4090 lands at £73/GB, sandwiched between the cheap 3090 and the more expensive Pro-Ada cards. Pure VRAM/£ would point at a used 3090, but you also lose 70 percent of the FP8 throughput, native FP8 entirely, and the 12x larger L2 cache. The 4090 D 48GB is the grey-market mod that fits 48 GB of GDDR6X on a 4090 die; it is the cheapest path to 48 GB of FP8-capable VRAM, but the cards are not officially supported and lack warranty.

Hosted price per GB-month

Buying outright includes power, cooling, networking, and depreciation that you have to amortise yourself. Hosted price per GB-month is closer to the real economic measure for production workloads where someone else handles the infrastructure.

GPUVRAMIndicative £/month£/GB-monthNotes
RTX 3090 24GB24£199£8.30No FP8, NVLink available
RTX 4090 24GB24£329£13.70FP8, AV1 NVENC, single best £/GB-FP8
RTX 5090 32GB32£499£15.60FP4, GDDR7, 1.78x bandwidth
RTX 6000 Ada 48GB48£899£18.70ECC, fits 70B FP8
RTX 6000 Pro 96GB96£1,199£12.50FP4, fits 70B FP16, best £/GB at the top
A100 80GB80£1,599£20.00HBM2e, no FP8
H100 80GB80£2,399£30.00HBM3, FP8, NVLink

The 4090 24GB at £13.70/GB-month is the second-cheapest hosted gigabyte (after the 3090) and the cheapest with native FP8. The RTX 6000 Pro 96GB at £12.50/GB-month actually beats the 4090 once you need 96 GB, because the larger card amortises the chassis and CPU cost over more memory. The pivot point is roughly 70 GB of working set: below that, the 4090 wins; above that, the RTX 6000 Pro wins.

Adjusted for bandwidth, FP8 and TFLOPS

VRAM alone is misleading because the gigabytes are not interchangeable. A bandwidth-adjusted index (£ per GB-month divided by bandwidth in TB/s) tells you what a gigabyte of fast VRAM actually costs.

GPU£/GB-monthBW (TB/s)FP8 dense£/(GB-month x TB/s) (lower better)FP8 effective £/GB-month
RTX 3090 24GB£8.300.94n/a£8.83n/a (no FP8)
RTX 4090 24GB£13.701.01660£13.56£6.85 (FP8 halves model bytes)
RTX 5090 32GB£15.601.79838£8.71£7.80
RTX 6000 Pro 96GB£12.501.79838£6.98£6.25
A100 80GB£20.002.04n/a£9.80n/a
H100 80GB£30.003.351979£8.96£15.00

The 4090 looks middling on raw bandwidth-adjusted index but has a special property: it is the cheapest card with native FP8. That cuts model bytes in half, so effective £/GB-FP8 drops to £6.85, beating every card except the RTX 6000 Pro 96GB. For a typical production LLM workload running FP8, the 4090 is the lowest-cost-per-FP8-gigabyte option in the consumer tier and beats every datacentre card except the very largest GDDR7 SKU. For a 200-MAU SaaS RAG that needs to fit 12B model + KV in 24 GB, the 4090 delivers the gigabyte at half the price of an H100.

Models that fit on 24 GB

ModelFormatWeightsKV @ 16k FP8Total @ 16kHeadroom
Llama 3.1 8BFP1616 GB1 GB17.5 GB6.5 GB (8x batch)
Llama 3.1 8BFP88 GB0.5 GB9.0 GB15 GB (32x batch)
Mistral 7B v0.3FP87.25 GB0.4 GB (sliding)8.4 GB15.6 GB
Mistral Nemo 12BFP812.2 GB0.75 GB14.0 GB10 GB
Mistral Small 3 24BAWQ INT413 GB1.5 GB15.5 GB8.5 GB
Llama 3.1 70BAWQ INT417 GB2.6 GB21 GB3 GB (4x batch)
Qwen 2.5 14BAWQ INT49 GB1.5 GB11.5 GB12.5 GB
Qwen 2.5 32BAWQ INT417 GB2 GB20 GB4 GB
Mixtral 8x7BAWQ INT425 GBn/aOOM
FLUX.1-devFP1623 GBn/a23 GB1 GB (single image)

The 24 GB envelope is enough for every flagship open model up to 70B if you accept INT4 weights and FP8 KV. Mixtral 8x7B is the only commonly-deployed model that does not fit; everything else is a question of how aggressive your quantisation can be. For a 12-engineer coding team running Qwen 2.5 14B AWQ on a single 4090, the 12 GB headroom supports 16 concurrent active sessions at 32k context, which is more than the team will sustain even at peak.

Cost-per-GB by real workload

WorkloadWorking setBest card£/monthEffective £/GB-month-used
Llama 3.1 8B FP8 chat10 GB4090 24GB£329£32.90
Mistral Nemo 12B FP8 RAG14 GB4090 24GB£329£23.50
Llama 3.1 70B AWQ chat21 GB4090 24GB£329£15.66
Llama 3.1 70B FP8 (full quality)76 GBRTX 6000 Pro 96GB£1,199£15.78
Mixtral 8x7B AWQ25 GB5090 32GB£499£19.96
FLUX.1-dev FP1623 GB4090 24GB£329£14.30

The 4090 wins on £/GB-of-actual-working-set for every workload that fits the 24 GB envelope. The 70B AWQ at £15.66/GB-used is a particularly striking number: it is within 1 percent of the £15.78 that the same model at full FP8 quality costs on the RTX 6000 Pro 96GB. The choice between them collapses to whether you can tolerate the 1.5 percent MMLU drop from AWQ INT4 to FP8.

Production gotchas

  1. Headroom for batched serving is not optional. A model that fits at 23/24 GB single-stream cannot serve batched traffic. Plan for at least 3-4 GB of free VRAM for KV cache growth at concurrent users.
  2. FP8 KV calibration is checkpoint-dependent. Some FP8 weights checkpoints behave badly with E5M2 KV at long context. Validate before committing your sizing.
  3. Driver overhead is non-zero. Expect 0.5-0.8 GB of CUDA + driver overhead on a fresh 4090 before any model loads. Account for it.
  4. Two models on one card competes for VRAM. Running Whisper alongside Llama 3.1 8B on the same 4090 is feasible (Whisper Turbo INT8 is ~2 GB) but tight. Plan VRAM as an exclusive resource per workload class.
  5. The 4090 D 48GB grey-market mod looks tempting. It is the cheapest 48 GB card on the market but lacks warranty, fails some firmware checks, and may not pass driver updates beyond CUDA 12.x. Use only with eyes open.
  6. Hosted £/month is sticker, not all-in. Egress, NVMe storage, and bandwidth overage can add 10-20 percent. Read the contract.
  7. VRAM market price is volatile. The 3090 used market spikes when consumer 4090 supply tightens. Hosted £/GB-month is more stable than hardware list.

Verdict and when 24 GB is the right answer

Pick the 4090 24GB on £/GB grounds when: your model fits 24 GB at FP8 or AWQ INT4 (everything up to Llama 70B INT4, Mistral Small 3 24B INT4, Qwen 2.5 32B AWQ, FLUX.1-dev); you need native FP8 (the cheapest card on the market for it); you serve 1-32 concurrent users at the per-stream rates the 4090 delivers; you want UK-hosted dedicated metal at a known monthly cost rather than per-second cloud billing surprises.

Skip the 4090 if your working set exceeds 23 GB even at INT4 (move to RTX 6000 Pro 96GB which has the best £/GB at the top end), if you need ECC for compliance reasons (RTX 6000 Ada or L40S), or if your batch is consistently above 64 with very long context (H100 HBM3 pulls decisively ahead at that scale). For the migration paths see the 3090 comparison, 5090 comparison, H100 comparison, and 4090 or 3090 decision.

Best £/GB-month for native FP8 inference, hosted in the UK

24 GB GDDR6X with full FP8 tensor core support. UK dedicated hosting.

Order the RTX 4090 24GB

See also: monthly hosting cost, 4090 vs 3090, 4090 vs 5090, 4090 or 3090?, 4090 or 5090?, 70B monthly cost, tier positioning 2026.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?