VRAM Requirements: The Basics
The single biggest factor in GPU hosting cost is VRAM. Choose too much and you overpay. Choose too little and your model will not fit. This guide helps you find the sweet spot: the minimum VRAM that runs your workload efficiently, so you can pick the cheapest dedicated GPU server that gets the job done.
The rule of thumb: a model needs approximately 2x its parameter count in GB at FP16 precision, plus overhead for KV cache and inference. A 7B model needs ~14GB, a 70B model needs ~140GB. But with quantisation, you can slash these requirements dramatically.
VRAM by Model Size
| Model Size | FP16 VRAM | INT8 VRAM | INT4 VRAM | Cheapest GPU Option | Monthly Cost |
|---|---|---|---|---|---|
| 1-3B (Phi-3 Mini) | ~6GB | ~3GB | ~2GB | RTX 3090 24GB | $99 |
| 7B (Mistral 7B) | ~14GB | ~7GB | ~4GB | RTX 3090 24GB | $99 |
| 13B | ~26GB | ~13GB | ~7GB | RTX 5090 32 GB (INT4) | $149 |
| 30-34B (Qwen 32B) | ~64GB | ~32GB | ~17GB | RTX 6000 Pro 96 GB (INT8) | $299 |
| 70B (LLaMA 3 70B) | ~140GB | ~70GB | ~35GB | 1x RTX 5090 (INT4) | $149 |
| 70B (quality) | ~140GB | ~70GB | — | 2x RTX 6000 Pro 96 GB (FP16) | $599 |
| 120-130B | ~260GB | ~130GB | ~65GB | 2x RTX 6000 Pro 96 GB (INT8) | $599 |
| 200B+ (MoE) | ~400GB+ | ~200GB | ~100GB | 4x RTX 6000 Pro 96 GB | $899 |
Notice the cost jump between model sizes. Going from 7B to 70B increases your hosting cost from $99 to $149-$599. That is why right-sizing your model is the single most impactful cost optimisation you can make. Use our cost per million tokens calculator to compare.
Quantisation: Cut VRAM Costs in Half
Quantisation reduces model precision from FP16 (16-bit) to INT8 (8-bit) or INT4 (4-bit). This directly translates to lower VRAM requirements and therefore cheaper GPU servers:
| 70B Model | Precision | VRAM Needed | Cheapest GPU | Monthly Cost | Quality Impact |
|---|---|---|---|---|---|
| LLaMA 3 70B | FP16 | ~140GB | 2x RTX 6000 Pro 96 GB | $599 | Baseline (best) |
| LLaMA 3 70B | INT8 (GPTQ) | ~70GB | 1x RTX 6000 Pro 96 GB | $299 | ~1% quality loss |
| LLaMA 3 70B | INT4 (GPTQ) | ~35GB | 1x RTX 5090 | $149 | ~3-5% quality loss |
INT8 quantisation saves $300/month (50% cost reduction) with negligible quality impact. INT4 saves $450/month (75% reduction) with minor quality loss acceptable for many production use cases.
For most workloads, INT8 is the sweet spot. Reserve FP16 for tasks requiring maximum accuracy (medical, legal, financial). See detailed quantisation benchmarks in our best GPU for inference guide.
KV Cache: The Hidden VRAM Cost
Model weights are only part of the VRAM equation. The KV (key-value) cache stores attention state for each active request and grows with:
- Sequence length: longer conversations or documents use more KV cache
- Concurrent users: each simultaneous request needs its own KV cache
- Model architecture: models with more attention heads use more KV cache
Rule of thumb: reserve 20-40% of VRAM beyond model weights for KV cache and overhead. A 70B INT8 model uses ~70GB for weights but needs ~85-90GB total for comfortable production operation.
This is why an RTX 6000 Pro 96 GB (not 40GB) is recommended for 70B INT8: the extra 40GB provides ample KV cache room for concurrent users. vLLM’s PagedAttention optimises KV cache memory, maximising the number of concurrent requests your GPU can handle.
GPU Cost Tiers: What You Get at Each Price
| Monthly Cost | GPU | VRAM | Max Model | Best For |
|---|---|---|---|---|
| $99 | RTX 3090 | 24GB | 7B FP16 / 13B INT8 | Small models, embeddings, Phi-3 |
| $149 | RTX 5090 | 24GB | 13B FP16 / 70B INT4 | Small-medium models, coding |
| $299 | RTX 6000 Pro 96 GB | 80GB | 30B FP16 / 70B INT8 | Medium models, production workloads |
| $599 | 2x RTX 6000 Pro 96 GB | 160GB | 70B FP16 / 130B INT8 | Large models, high quality |
| $899 | 4x RTX 6000 Pro 96 GB | 320GB | 200B+ FP16 | High throughput, massive models |
| $1,599 | 8x RTX 6000 Pro 96 GB | 640GB | 400B+ FP16 | Enterprise, multi-model clusters |
See our cheapest GPU for AI inference guide and RTX 3090 vs RTX 5090 comparison for detailed hardware analysis.
Match Your Workload to the Right GPU
| Workload | Recommended Model | VRAM Needed | Cheapest GPU | Monthly Cost |
|---|---|---|---|---|
| Customer chatbot | Mistral 7B or LLaMA 3 8B | 16-20GB | RTX 5090 | $149 |
| RAG / document QA | Qwen 32B + embeddings | 40-60GB | RTX 6000 Pro 96 GB | $299 |
| Premium chatbot | LLaMA 3 70B | 80-140GB | RTX 6000 Pro 96 GB (INT8) | $299 |
| Coding assistant | DeepSeek Coder 6.7B | 14-18GB | RTX 5090 | $149 |
| Video generation | Stable Video Diffusion | 24-80GB | RTX 6000 Pro 96 GB | $299 |
| Image generation | SDXL / Flux | 12-24GB | RTX 5090 | $149 |
| Speech / TTS | Whisper + XTTS | 8-16GB | RTX 3090 | $99 |
Cost Optimisation Tips
- Start with the smallest model that meets your quality bar. A fine-tuned 7B model often outperforms a generic 70B model on specific tasks.
- Use INT8 quantisation by default. The quality loss is negligible for most applications and it halves your VRAM (and cost).
- Run multiple small models on one GPU. A 24GB GPU can host a 7B chat model AND an embedding model simultaneously.
- Use vLLM for production. Its PagedAttention mechanism maximises concurrent users per GB of VRAM.
- Consider MoE models. DeepSeek-V2 has 236B parameters but only activates 21B, giving large-model quality at small-model VRAM usage.
- Benchmark before committing. Use our tokens per second benchmark to verify throughput.
For per-model cost breakdowns, see our guides for LLaMA 3, DeepSeek, Mistral, Qwen, and Phi-3. For the complete self-hosting economics, read our complete cost guide and ROI analysis.
Get the Right GPU for Your Budget
From $99/month for 24GB to $1,599 for 640GB. Find your optimal configuration.
Browse GPU Servers