RTX 3050 - Order Now
Home / Blog / Cost & Pricing / How Much VRAM Do You Actually Need? (Cost Optimisation Guide)
Cost & Pricing

How Much VRAM Do You Actually Need? (Cost Optimisation Guide)

Stop overpaying for GPU memory. This guide shows exactly how much VRAM each AI model needs and how to choose the cheapest GPU configuration that handles your workload.

VRAM Requirements: The Basics

The single biggest factor in GPU hosting cost is VRAM. Choose too much and you overpay. Choose too little and your model will not fit. This guide helps you find the sweet spot: the minimum VRAM that runs your workload efficiently, so you can pick the cheapest dedicated GPU server that gets the job done.

The rule of thumb: a model needs approximately 2x its parameter count in GB at FP16 precision, plus overhead for KV cache and inference. A 7B model needs ~14GB, a 70B model needs ~140GB. But with quantisation, you can slash these requirements dramatically.

VRAM by Model Size

Model SizeFP16 VRAMINT8 VRAMINT4 VRAMCheapest GPU OptionMonthly Cost
1-3B (Phi-3 Mini)~6GB~3GB~2GBRTX 3090 24GB$99
7B (Mistral 7B)~14GB~7GB~4GBRTX 3090 24GB$99
13B~26GB~13GB~7GBRTX 5090 32 GB (INT4)$149
30-34B (Qwen 32B)~64GB~32GB~17GBRTX 6000 Pro 96 GB (INT8)$299
70B (LLaMA 3 70B)~140GB~70GB~35GB1x RTX 5090 (INT4)$149
70B (quality)~140GB~70GB2x RTX 6000 Pro 96 GB (FP16)$599
120-130B~260GB~130GB~65GB2x RTX 6000 Pro 96 GB (INT8)$599
200B+ (MoE)~400GB+~200GB~100GB4x RTX 6000 Pro 96 GB$899

Notice the cost jump between model sizes. Going from 7B to 70B increases your hosting cost from $99 to $149-$599. That is why right-sizing your model is the single most impactful cost optimisation you can make. Use our cost per million tokens calculator to compare.

Quantisation: Cut VRAM Costs in Half

Quantisation reduces model precision from FP16 (16-bit) to INT8 (8-bit) or INT4 (4-bit). This directly translates to lower VRAM requirements and therefore cheaper GPU servers:

70B ModelPrecisionVRAM NeededCheapest GPUMonthly CostQuality Impact
LLaMA 3 70BFP16~140GB2x RTX 6000 Pro 96 GB$599Baseline (best)
LLaMA 3 70BINT8 (GPTQ)~70GB1x RTX 6000 Pro 96 GB$299~1% quality loss
LLaMA 3 70BINT4 (GPTQ)~35GB1x RTX 5090$149~3-5% quality loss

INT8 quantisation saves $300/month (50% cost reduction) with negligible quality impact. INT4 saves $450/month (75% reduction) with minor quality loss acceptable for many production use cases.

For most workloads, INT8 is the sweet spot. Reserve FP16 for tasks requiring maximum accuracy (medical, legal, financial). See detailed quantisation benchmarks in our best GPU for inference guide.

Calculate Your Savings

See exactly how much you’d save by self-hosting.

LLM Cost Calculator

KV Cache: The Hidden VRAM Cost

Model weights are only part of the VRAM equation. The KV (key-value) cache stores attention state for each active request and grows with:

  • Sequence length: longer conversations or documents use more KV cache
  • Concurrent users: each simultaneous request needs its own KV cache
  • Model architecture: models with more attention heads use more KV cache

Rule of thumb: reserve 20-40% of VRAM beyond model weights for KV cache and overhead. A 70B INT8 model uses ~70GB for weights but needs ~85-90GB total for comfortable production operation.

This is why an RTX 6000 Pro 96 GB (not 40GB) is recommended for 70B INT8: the extra 40GB provides ample KV cache room for concurrent users. vLLM’s PagedAttention optimises KV cache memory, maximising the number of concurrent requests your GPU can handle.

GPU Cost Tiers: What You Get at Each Price

Monthly CostGPUVRAMMax ModelBest For
$99RTX 309024GB7B FP16 / 13B INT8Small models, embeddings, Phi-3
$149RTX 509024GB13B FP16 / 70B INT4Small-medium models, coding
$299RTX 6000 Pro 96 GB80GB30B FP16 / 70B INT8Medium models, production workloads
$5992x RTX 6000 Pro 96 GB160GB70B FP16 / 130B INT8Large models, high quality
$8994x RTX 6000 Pro 96 GB320GB200B+ FP16High throughput, massive models
$1,5998x RTX 6000 Pro 96 GB640GB400B+ FP16Enterprise, multi-model clusters

See our cheapest GPU for AI inference guide and RTX 3090 vs RTX 5090 comparison for detailed hardware analysis.

Match Your Workload to the Right GPU

WorkloadRecommended ModelVRAM NeededCheapest GPUMonthly Cost
Customer chatbotMistral 7B or LLaMA 3 8B16-20GBRTX 5090$149
RAG / document QAQwen 32B + embeddings40-60GBRTX 6000 Pro 96 GB$299
Premium chatbotLLaMA 3 70B80-140GBRTX 6000 Pro 96 GB (INT8)$299
Coding assistantDeepSeek Coder 6.7B14-18GBRTX 5090$149
Video generationStable Video Diffusion24-80GBRTX 6000 Pro 96 GB$299
Image generationSDXL / Flux12-24GBRTX 5090$149
Speech / TTSWhisper + XTTS8-16GBRTX 3090$99

Cost Optimisation Tips

  1. Start with the smallest model that meets your quality bar. A fine-tuned 7B model often outperforms a generic 70B model on specific tasks.
  2. Use INT8 quantisation by default. The quality loss is negligible for most applications and it halves your VRAM (and cost).
  3. Run multiple small models on one GPU. A 24GB GPU can host a 7B chat model AND an embedding model simultaneously.
  4. Use vLLM for production. Its PagedAttention mechanism maximises concurrent users per GB of VRAM.
  5. Consider MoE models. DeepSeek-V2 has 236B parameters but only activates 21B, giving large-model quality at small-model VRAM usage.
  6. Benchmark before committing. Use our tokens per second benchmark to verify throughput.

For per-model cost breakdowns, see our guides for LLaMA 3, DeepSeek, Mistral, Qwen, and Phi-3. For the complete self-hosting economics, read our complete cost guide and ROI analysis.

Get the Right GPU for Your Budget

From $99/month for 24GB to $1,599 for 640GB. Find your optimal configuration.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?