Home / Blog / Cost & Pricing / How Much VRAM Do You Actually Need? (Cost Optimisation Guide)

Cost & Pricing

How Much VRAM Do You Actually Need? (Cost Optimisation Guide)

Stop overpaying for GPU memory. This guide shows exactly how much VRAM each AI model needs and how to choose the cheapest GPU configuration that handles your workload.

Cost & Pricing April 13, 2026 4 min read admin

Table of Contents

VRAM Requirements: The Basics
VRAM by Model Size
Quantisation: Cut VRAM Costs in Half
KV Cache: The Hidden VRAM Cost
GPU Cost Tiers: What You Get at Each Price
Match Your Workload to the Right GPU
Cost Optimisation Tips

VRAM Requirements: The Basics

The single biggest factor in GPU hosting cost is VRAM. Choose too much and you overpay. Choose too little and your model will not fit. This guide helps you find the sweet spot: the minimum VRAM that runs your workload efficiently, so you can pick the cheapest dedicated GPU server that gets the job done.

The rule of thumb: a model needs approximately 2x its parameter count in GB at FP16 precision, plus overhead for KV cache and inference. A 7B model needs ~14GB, a 70B model needs ~140GB. But with quantisation, you can slash these requirements dramatically.

VRAM by Model Size

Model Size	FP16 VRAM	INT8 VRAM	INT4 VRAM	Cheapest GPU Option	Monthly Cost
1-3B (Phi-3 Mini)	~6GB	~3GB	~2GB	RTX 3090 24GB	$99
7B (Mistral 7B)	~14GB	~7GB	~4GB	RTX 3090 24GB	$99
13B	~26GB	~13GB	~7GB	RTX 5090 32 GB (INT4)	$149
30-34B (Qwen 32B)	~64GB	~32GB	~17GB	RTX 6000 Pro 96 GB (INT8)	$299
70B (LLaMA 3 70B)	~140GB	~70GB	~35GB	1x RTX 5090 (INT4)	$149
70B (quality)	~140GB	~70GB	—	2x RTX 6000 Pro 96 GB (FP16)	$599
120-130B	~260GB	~130GB	~65GB	2x RTX 6000 Pro 96 GB (INT8)	$599
200B+ (MoE)	~400GB+	~200GB	~100GB	4x RTX 6000 Pro 96 GB	$899

Notice the cost jump between model sizes. Going from 7B to 70B increases your hosting cost from $99 to $149-$599. That is why right-sizing your model is the single most impactful cost optimisation you can make. Use our cost per million tokens calculator to compare.

Quantisation: Cut VRAM Costs in Half

Quantisation reduces model precision from FP16 (16-bit) to INT8 (8-bit) or INT4 (4-bit). This directly translates to lower VRAM requirements and therefore cheaper GPU servers:

70B Model	Precision	VRAM Needed	Cheapest GPU	Monthly Cost	Quality Impact
LLaMA 3 70B	FP16	~140GB	2x RTX 6000 Pro 96 GB	$599	Baseline (best)
LLaMA 3 70B	INT8 (GPTQ)	~70GB	1x RTX 6000 Pro 96 GB	$299	~1% quality loss
LLaMA 3 70B	INT4 (GPTQ)	~35GB	1x RTX 5090	$149	~3-5% quality loss

INT8 quantisation saves $300/month (50% cost reduction) with negligible quality impact. INT4 saves $450/month (75% reduction) with minor quality loss acceptable for many production use cases.

For most workloads, INT8 is the sweet spot. Reserve FP16 for tasks requiring maximum accuracy (medical, legal, financial). See detailed quantisation benchmarks in our best GPU for inference guide.

Calculate Your Savings

See exactly how much you’d save by self-hosting.

LLM Cost Calculator

KV Cache: The Hidden VRAM Cost

Model weights are only part of the VRAM equation. The KV (key-value) cache stores attention state for each active request and grows with:

Sequence length: longer conversations or documents use more KV cache
Concurrent users: each simultaneous request needs its own KV cache
Model architecture: models with more attention heads use more KV cache

Rule of thumb: reserve 20-40% of VRAM beyond model weights for KV cache and overhead. A 70B INT8 model uses ~70GB for weights but needs ~85-90GB total for comfortable production operation.

This is why an RTX 6000 Pro 96 GB (not 40GB) is recommended for 70B INT8: the extra 40GB provides ample KV cache room for concurrent users. vLLM’s PagedAttention optimises KV cache memory, maximising the number of concurrent requests your GPU can handle.

GPU Cost Tiers: What You Get at Each Price

Monthly Cost	GPU	VRAM	Max Model	Best For
$99	RTX 3090	24GB	7B FP16 / 13B INT8	Small models, embeddings, Phi-3
$149	RTX 5090	24GB	13B FP16 / 70B INT4	Small-medium models, coding
$299	RTX 6000 Pro 96 GB	80GB	30B FP16 / 70B INT8	Medium models, production workloads
$599	2x RTX 6000 Pro 96 GB	160GB	70B FP16 / 130B INT8	Large models, high quality
$899	4x RTX 6000 Pro 96 GB	320GB	200B+ FP16	High throughput, massive models
$1,599	8x RTX 6000 Pro 96 GB	640GB	400B+ FP16	Enterprise, multi-model clusters

See our cheapest GPU for AI inference guide and RTX 3090 vs RTX 5090 comparison for detailed hardware analysis.

Match Your Workload to the Right GPU

Workload	Recommended Model	VRAM Needed	Cheapest GPU	Monthly Cost
Customer chatbot	Mistral 7B or LLaMA 3 8B	16-20GB	RTX 5090	$149
RAG / document QA	Qwen 32B + embeddings	40-60GB	RTX 6000 Pro 96 GB	$299
Premium chatbot	LLaMA 3 70B	80-140GB	RTX 6000 Pro 96 GB (INT8)	$299
Coding assistant	DeepSeek Coder 6.7B	14-18GB	RTX 5090	$149
Video generation	Stable Video Diffusion	24-80GB	RTX 6000 Pro 96 GB	$299
Image generation	SDXL / Flux	12-24GB	RTX 5090	$149
Speech / TTS	Whisper + XTTS	8-16GB	RTX 3090	$99

Cost Optimisation Tips

Start with the smallest model that meets your quality bar. A fine-tuned 7B model often outperforms a generic 70B model on specific tasks.
Use INT8 quantisation by default. The quality loss is negligible for most applications and it halves your VRAM (and cost).
Run multiple small models on one GPU. A 24GB GPU can host a 7B chat model AND an embedding model simultaneously.
Use vLLM for production. Its PagedAttention mechanism maximises concurrent users per GB of VRAM.
Consider MoE models. DeepSeek-V2 has 236B parameters but only activates 21B, giving large-model quality at small-model VRAM usage.
Benchmark before committing. Use our tokens per second benchmark to verify throughput.

For per-model cost breakdowns, see our guides for LLaMA 3, DeepSeek, Mistral, Qwen, and Phi-3. For the complete self-hosting economics, read our complete cost guide and ROI analysis.

Get the Right GPU for Your Budget

From $99/month for 24GB to $1,599 for 640GB. Find your optimal configuration.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Cost & Pricing

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

How Much VRAM Do You Actually Need? (Cost Optimisation Guide)

VRAM Requirements: The Basics

VRAM by Model Size

Quantisation: Cut VRAM Costs in Half

Calculate Your Savings

KV Cache: The Hidden VRAM Cost

GPU Cost Tiers: What You Get at Each Price

Match Your Workload to the Right GPU

Cost Optimisation Tips

Get the Right GPU for Your Budget

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

How Much VRAM Do You Actually Need? (Cost Optimisation Guide)

VRAM Requirements: The Basics

VRAM by Model Size

Quantisation: Cut VRAM Costs in Half

Calculate Your Savings

KV Cache: The Hidden VRAM Cost

GPU Cost Tiers: What You Get at Each Price

Match Your Workload to the Right GPU

Cost Optimisation Tips

Get the Right GPU for Your Budget

Need a Dedicated GPU Server?

admin

Related Articles

Qwen 7B on RTX 4060: Monthly Cost & Token Output

HF Endpoints vs Dedicated GPU for Classification

Migrate from ElevenLabs to Dedicated GPU: Savings Calculator

Embedding Generation: Cost at 1B Tokens/Month

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?