RTX 3050 - Order Now
Ada Lovelace · 24 GB · 13B Sweet Spot

NVIDIA RTX 4090 Hosting — The 24 GB Workhorse

The 24 GB sweet-spot for 13B-class models. Ada Lovelace at £289/mo, with hardware FP8, 1 TB/s bandwidth, and enough VRAM to run Llama 3.1 8B at full FP16 with 32K context — or a 13B at FP8 with KV cache to spare. The dependable production workhorse.

24 GB GDDR6X Hardware FP8 (660 TOPS) 1,008 GB/s bandwidth From £289/mo
24 GB
GDDR6X VRAM
16,384
CUDA cores
1,008 GB/s
Memory bandwidth
£289
/mo from

RTX 4090 Server Specs

The hardware you actually rent.

GPU modelNVIDIA GeForce RTX 4090 (Ada Lovelace, AD102)
ArchitectureAda Lovelace — 4th gen Tensor Cores
VRAM24 GB GDDR6X @ 1,008 GB/s
CUDA cores16,384
FP16 compute~ 82.6 TFLOPS
FP8~ 660 TOPS (no hardware FP4 on Ada)
TDP450 W
Host CPUAMD Ryzen 7 / 9
Host RAMUp to 64 GB DDR5
Storage1 TB NVMe + 4 TB SATA SSD
Network1 Gbps unmetered
LocationLondon, United Kingdom

What Fits on a Single RTX 4090

24 GB is the practical sweet spot for production. Comfortable for 7B–8B at full FP16 with long context, and capable of 13B–14B at FP8 or AWQ-INT4 with full 8K–32K context windows.

ModelParamsFP16FP8 / INT4Notes
Mistral 7B Instruct7B14 GB FP165 GB INT4FP16 fits comfortably with 32K context
Llama 3.1 8B8B16 GB FP168 GB FP8FP16 fits 32K, FP8 fits 128K context
Llama 3.1 13B13B26 GB FP1613 GB FP8FP16 won’t fit — FP8 is comfortable
Qwen 2.5 14B14B28 GB FP169 GB INT4FP16 won’t fit — AWQ-INT4 only, FP8 fits with 8K context
Codestral 22B22B44 GB FP1612 GB INT4AWQ-INT4 only, tight KV budget
DeepSeek-Coder V2 Lite16B MoE (~3B active)~17 GB FP16~9 GB FP8Fits FP16 — the cheapest comfortable home for it
FLUX.1 dev12B24 GB FP1612 GB FP8FP16 borderline — FP8 fits comfortably
SDXL 1.03.5B8 GB FP164 GB FP8FP16 fits with headroom for ControlNets
Whisper Large-v31.5B6 GBn/aPlenty of room left for an LLM alongside

When the RTX 4090 Is the Right Card

Real customer workloads we run on this hardware every day.

13B / 14B class production

The 4090 is the cheapest GPU we host that comfortably runs 13B at FP8 with full context. If your eval shows a 13B beating an 8B, this is the card to deploy on.

Llama 3.1 13BQwen 2.5 14BFP8 inference

Code assistants

Code Llama 13B and DeepSeek-Coder V2 Lite both fit at FP8 with KV cache to spare. The 4090 is the cheapest comfortable host for either model in production.

Code Llama 13BDeepSeek-Coder V2 LiteCodestral AWQ

FLUX.1 dev image generation

FP8 FLUX.1 dev produces a 1024×1024 in under 10 seconds with full LoRA support. The 4090 is the price/quality pick for FLUX in production.

FLUX.1 dev FP8SDXL + ControlNetComfyUI

Full-FP16 7B/8B with long context

Llama 3.1 8B at full FP16 with a 32K context window — no quantisation tradeoff, no quality concerns. The 24 GB envelope is exactly right for “the best 8B you can serve.”

Llama 3.1 8B FP1632K contextRAG backends

Voice agent backend

Whisper Large-v3 + a 13B LLM at FP8 + a TTS model all fit on one card with room left for KV cache. Roughly 6–8 concurrent voice sessions per server.

Whisper13B LLMTTS

Mid-tier production deployment

Headroom matters. The 4090’s 24 GB lets you load a primary model plus an embedding model plus a reranker without juggling. The dependable workhorse for serious production.

Multi-model servingEmbeddings + rerankerProduction stacks

RTX 4090 vs Other Cards

How this card stacks up against the rest of the GigaGPU catalogue for the workloads we benchmark.

GPUVRAMThroughput / Notes13B FP8 fits?Price
RTX 409024 GB GDDR6X~110 tok/s (Llama 3.1 8B FP8 single-stream)Yes, with full 32K contextfrom £289
RTX 508016 GB GDDR7Faster per-tok on FP8 (Blackwell), 50% less VRAMFP8 only with short contextfrom £189
RTX 509032 GB GDDR7Hardware FP4 + 32 GB envelope, 38% more expensiveYes — and 70B INT4 fits toofrom £399
RTX 309024 GB GDDR6X~30% slower, no FP8 hardware path, 45% cheaperYes (FP16/INT4 only — no native FP8)from £159
RTX 6000 PRO96 GB GDDR74× the VRAM, ECC, only single-card 70B FP8 optionYes — and 70B FP8 fitsfrom £899

Deep Dive

"The 24 GB sweet-spot" — what we mean

VRAM tiers in production GPU hosting cluster around three points: 16 GB (one 8B model, tight), 24 GB (one 13B model comfortably, or an 8B with a long context and a friend on the side), and 32 GB+ (room for genuine multi-model stacks). The 4090’s 24 GB lands exactly on the middle tier — and at £289/mo it’s a dependable place to stand on it.

If your eval matrix has 13B models beating 8B models on the metrics that matter, the 4090 is almost certainly the right deployment target. The 5080 forces FP8 with short context. The 5090 costs 38% more. The 6000 PRO costs 3× more for VRAM you won’t use unless you’re serving 70B.

Why we still recommend the 5080 over the 4090 for some teams

The honest answer: if your model fits in 16 GB and you care about latency over headroom, the 5080 wins. Blackwell tensor cores are faster per-token than Ada, hardware FP4 saves another 2× on memory pressure, and you save £100/mo. For single-model serving of a 7B at FP8, the 5080 is the sharper tool.

The 4090 wins the moment you need: (a) FP16 quality on an 8B, (b) any 13B/14B model, (c) FLUX.1 dev, or (d) multiple models loaded at once. That covers most production stacks we see — which is why the 4090 stays our most-deployed mid-tier card.

FP8 path matters — but FP4 is missing

The 4090’s Ada tensor cores have hardware FP8 (~660 TOPS) but no hardware FP4. That last detail matters less than the marketing makes it sound. FP4 gets you another 2× memory headroom and ~1.5× speed on the 5090 — useful for squeezing 70B onto 32 GB. On a 24 GB 4090 running 13B-and-below, FP8 is already the right precision tier:

  • Llama 3.1 13B at FP16 → 26 GB. Won’t fit.
  • Llama 3.1 13B at FP8 → 13 GB. Comfortable, with headroom for long context.
  • Llama 3.1 13B at INT4/AWQ → 7 GB. Plenty of room for a second model.

Most production deployments on a 4090 land at FP8 — best balance of quality, memory, and speed. If you genuinely need FP4 (you’re trying to fit a 70B on a single card), the 5090 or 6000 PRO is the right call.

Frequently Asked Questions

The questions buyers actually ask before committing to a GPU server.

Can I run Llama 3.1 13B on a single 4090?

Yes at FP8 (13 GB weights + KV cache, comfortable with 8K–32K context). Not at FP16 — that needs 26 GB and the 4090 has 24 GB. Most production teams run 13B at FP8 on the 4090 and don’t notice the quality difference vs FP16.

4090 vs 5080 — which should I pick?

5080 if your model fits in 16 GB and you want the lowest latency. 4090 if you need 24 GB headroom — for any 13B model, FLUX.1 dev, or multi-model serving. The 4090 costs £100 more but saves you the "does it fit?" gymnastics.

Is the 4090 enough for fine-tuning?

QLoRA on 7B–13B models works well. Full SFT on 13B+ does not — go to a 5090, two 4090s, or a 6000 PRO for that.

How does it compare to the 3090?

Same 24 GB. The 4090 is ~30% faster per token, has hardware FP8 (3090 has no FP8 path), and pulls 450 W vs 350 W. The 3090 is 45% cheaper at £159/mo. If you’re cost-sensitive and don’t need FP8, the 3090 is still a solid pick.

Can I run two 4090s in one server?

Yes via PCIe. They don’t have NVLink. 2× 4090 = 48 GB combined, which lets you run a 70B at INT4 with tensor parallel, or a 30B at FP8. Talk to sales for dual-GPU pricing.

Will FLUX.1 dev fit?

FP16 is borderline (24 GB transformer + VAE + text encoders is right at the edge). FP8 fits with comfortable headroom for LoRAs and ControlNets. Most ComfyUI users run FP8 on the 4090 in production.

Power draw at 100% load?

450 W. We chassis it with a 4U cooler and a 1,000 W PSU headroom. Stable at sustained load.

Same-day deployment?

Yes for in-stock SKUs. Out-of-stock 4090 lead time is 2–3 working days.

13B in production? The 4090 is your card.

24 GB GDDR6X, hardware FP8, the cheapest comfortable home for any 13B–14B model. From £289/mo with same-day deployment.

Have a question? Need help?