NVIDIA RTX 4090 Hosting — The 24 GB Workhorse
The 24 GB sweet-spot for 13B-class models. Ada Lovelace at £289/mo, with hardware FP8, 1 TB/s bandwidth, and enough VRAM to run Llama 3.1 8B at full FP16 with 32K context — or a 13B at FP8 with KV cache to spare. The dependable production workhorse.
RTX 4090 Server Specs
The hardware you actually rent.
| GPU model | NVIDIA GeForce RTX 4090 (Ada Lovelace, AD102) |
|---|---|
| Architecture | Ada Lovelace — 4th gen Tensor Cores |
| VRAM | 24 GB GDDR6X @ 1,008 GB/s |
| CUDA cores | 16,384 |
| FP16 compute | ~ 82.6 TFLOPS |
| FP8 | ~ 660 TOPS (no hardware FP4 on Ada) |
| TDP | 450 W |
| Host CPU | AMD Ryzen 7 / 9 |
| Host RAM | Up to 64 GB DDR5 |
| Storage | 1 TB NVMe + 4 TB SATA SSD |
| Network | 1 Gbps unmetered |
| Location | London, United Kingdom |
What Fits on a Single RTX 4090
24 GB is the practical sweet spot for production. Comfortable for 7B–8B at full FP16 with long context, and capable of 13B–14B at FP8 or AWQ-INT4 with full 8K–32K context windows.
| Model | Params | FP16 | FP8 / INT4 | Notes |
|---|---|---|---|---|
| Mistral 7B Instruct | 7B | 14 GB FP16 | 5 GB INT4 | FP16 fits comfortably with 32K context |
| Llama 3.1 8B | 8B | 16 GB FP16 | 8 GB FP8 | FP16 fits 32K, FP8 fits 128K context |
| Llama 3.1 13B | 13B | 26 GB FP16 | 13 GB FP8 | FP16 won’t fit — FP8 is comfortable |
| Qwen 2.5 14B | 14B | 28 GB FP16 | 9 GB INT4 | FP16 won’t fit — AWQ-INT4 only, FP8 fits with 8K context |
| Codestral 22B | 22B | 44 GB FP16 | 12 GB INT4 | AWQ-INT4 only, tight KV budget |
| DeepSeek-Coder V2 Lite | 16B MoE (~3B active) | ~17 GB FP16 | ~9 GB FP8 | Fits FP16 — the cheapest comfortable home for it |
| FLUX.1 dev | 12B | 24 GB FP16 | 12 GB FP8 | FP16 borderline — FP8 fits comfortably |
| SDXL 1.0 | 3.5B | 8 GB FP16 | 4 GB FP8 | FP16 fits with headroom for ControlNets |
| Whisper Large-v3 | 1.5B | 6 GB | n/a | Plenty of room left for an LLM alongside |
When the RTX 4090 Is the Right Card
Real customer workloads we run on this hardware every day.
13B / 14B class production
The 4090 is the cheapest GPU we host that comfortably runs 13B at FP8 with full context. If your eval shows a 13B beating an 8B, this is the card to deploy on.
Code assistants
Code Llama 13B and DeepSeek-Coder V2 Lite both fit at FP8 with KV cache to spare. The 4090 is the cheapest comfortable host for either model in production.
FLUX.1 dev image generation
FP8 FLUX.1 dev produces a 1024×1024 in under 10 seconds with full LoRA support. The 4090 is the price/quality pick for FLUX in production.
Full-FP16 7B/8B with long context
Llama 3.1 8B at full FP16 with a 32K context window — no quantisation tradeoff, no quality concerns. The 24 GB envelope is exactly right for “the best 8B you can serve.”
Voice agent backend
Whisper Large-v3 + a 13B LLM at FP8 + a TTS model all fit on one card with room left for KV cache. Roughly 6–8 concurrent voice sessions per server.
Mid-tier production deployment
Headroom matters. The 4090’s 24 GB lets you load a primary model plus an embedding model plus a reranker without juggling. The dependable workhorse for serious production.
RTX 4090 vs Other Cards
How this card stacks up against the rest of the GigaGPU catalogue for the workloads we benchmark.
| GPU | VRAM | Throughput / Notes | 13B FP8 fits? | Price |
|---|---|---|---|---|
| RTX 4090 | 24 GB GDDR6X | ~110 tok/s (Llama 3.1 8B FP8 single-stream) | Yes, with full 32K context | from £289 |
| RTX 5080 | 16 GB GDDR7 | Faster per-tok on FP8 (Blackwell), 50% less VRAM | FP8 only with short context | from £189 |
| RTX 5090 | 32 GB GDDR7 | Hardware FP4 + 32 GB envelope, 38% more expensive | Yes — and 70B INT4 fits too | from £399 |
| RTX 3090 | 24 GB GDDR6X | ~30% slower, no FP8 hardware path, 45% cheaper | Yes (FP16/INT4 only — no native FP8) | from £159 |
| RTX 6000 PRO | 96 GB GDDR7 | 4× the VRAM, ECC, only single-card 70B FP8 option | Yes — and 70B FP8 fits | from £899 |
Deep Dive
"The 24 GB sweet-spot" — what we mean
VRAM tiers in production GPU hosting cluster around three points: 16 GB (one 8B model, tight), 24 GB (one 13B model comfortably, or an 8B with a long context and a friend on the side), and 32 GB+ (room for genuine multi-model stacks). The 4090’s 24 GB lands exactly on the middle tier — and at £289/mo it’s a dependable place to stand on it.
If your eval matrix has 13B models beating 8B models on the metrics that matter, the 4090 is almost certainly the right deployment target. The 5080 forces FP8 with short context. The 5090 costs 38% more. The 6000 PRO costs 3× more for VRAM you won’t use unless you’re serving 70B.
Why we still recommend the 5080 over the 4090 for some teams
The honest answer: if your model fits in 16 GB and you care about latency over headroom, the 5080 wins. Blackwell tensor cores are faster per-token than Ada, hardware FP4 saves another 2× on memory pressure, and you save £100/mo. For single-model serving of a 7B at FP8, the 5080 is the sharper tool.
The 4090 wins the moment you need: (a) FP16 quality on an 8B, (b) any 13B/14B model, (c) FLUX.1 dev, or (d) multiple models loaded at once. That covers most production stacks we see — which is why the 4090 stays our most-deployed mid-tier card.
FP8 path matters — but FP4 is missing
The 4090’s Ada tensor cores have hardware FP8 (~660 TOPS) but no hardware FP4. That last detail matters less than the marketing makes it sound. FP4 gets you another 2× memory headroom and ~1.5× speed on the 5090 — useful for squeezing 70B onto 32 GB. On a 24 GB 4090 running 13B-and-below, FP8 is already the right precision tier:
- Llama 3.1 13B at FP16 → 26 GB. Won’t fit.
- Llama 3.1 13B at FP8 → 13 GB. Comfortable, with headroom for long context.
- Llama 3.1 13B at INT4/AWQ → 7 GB. Plenty of room for a second model.
Most production deployments on a 4090 land at FP8 — best balance of quality, memory, and speed. If you genuinely need FP4 (you’re trying to fit a 70B on a single card), the 5090 or 6000 PRO is the right call.
Frequently Asked Questions
The questions buyers actually ask before committing to a GPU server.
Can I run Llama 3.1 13B on a single 4090?
Yes at FP8 (13 GB weights + KV cache, comfortable with 8K–32K context). Not at FP16 — that needs 26 GB and the 4090 has 24 GB. Most production teams run 13B at FP8 on the 4090 and don’t notice the quality difference vs FP16.
4090 vs 5080 — which should I pick?
5080 if your model fits in 16 GB and you want the lowest latency. 4090 if you need 24 GB headroom — for any 13B model, FLUX.1 dev, or multi-model serving. The 4090 costs £100 more but saves you the "does it fit?" gymnastics.
Is the 4090 enough for fine-tuning?
QLoRA on 7B–13B models works well. Full SFT on 13B+ does not — go to a 5090, two 4090s, or a 6000 PRO for that.
How does it compare to the 3090?
Same 24 GB. The 4090 is ~30% faster per token, has hardware FP8 (3090 has no FP8 path), and pulls 450 W vs 350 W. The 3090 is 45% cheaper at £159/mo. If you’re cost-sensitive and don’t need FP8, the 3090 is still a solid pick.
Can I run two 4090s in one server?
Yes via PCIe. They don’t have NVLink. 2× 4090 = 48 GB combined, which lets you run a 70B at INT4 with tensor parallel, or a 30B at FP8. Talk to sales for dual-GPU pricing.
Will FLUX.1 dev fit?
FP16 is borderline (24 GB transformer + VAE + text encoders is right at the edge). FP8 fits with comfortable headroom for LoRAs and ControlNets. Most ComfyUI users run FP8 on the 4090 in production.
Power draw at 100% load?
450 W. We chassis it with a 4U cooler and a 1,000 W PSU headroom. Stable at sustained load.
Same-day deployment?
Yes for in-stock SKUs. Out-of-stock 4090 lead time is 2–3 working days.
Related Pages
Pages our visitors typically read next.
13B in production? The 4090 is your card.
24 GB GDDR6X, hardware FP8, the cheapest comfortable home for any 13B–14B model. From £289/mo with same-day deployment.