RTX 3050 - Order Now
Home / Blog / Tutorials / Fine-Tuning VRAM Calculator: How Much Do You Need?
Tutorials

Fine-Tuning VRAM Calculator: How Much Do You Need?

A practical VRAM calculator for LLM fine-tuning covering LoRA, QLoRA, and full fine-tuning across model sizes from 7B to 70B, with GPU recommendations.

Why VRAM Calculation Matters

Running out of VRAM mid-training wastes hours and money. Before spinning up a dedicated GPU server for fine-tuning, you need to know exactly how much memory your training run will require. The VRAM footprint depends on the model size, fine-tuning method, sequence length, batch size, and precision — and the interactions between these variables are not always intuitive.

This guide provides formulas and reference tables so you can calculate your VRAM needs before committing to hardware. For method-specific details, see our LoRA vs QLoRA vs full fine-tuning comparison.

The VRAM Formula

Total training VRAM has four components:

Total VRAM = Model Weights + Optimiser States + Gradients + Activations

ComponentFull FT (FP16)LoRA (FP16)QLoRA (INT4)
Model weights2 bytes x params2 bytes x params0.5 bytes x params
Optimiser states8 bytes x params8 bytes x trainable8 bytes x trainable
Gradients2 bytes x params2 bytes x trainable2 bytes x trainable
Activations~2-4 GB (varies)~1-2 GB~1-2 GB

For full fine-tuning, the optimiser (Adam) stores first and second moment estimates (8 bytes per parameter), making it the dominant cost. LoRA and QLoRA only compute optimiser states for the trainable adapter parameters (typically 0.1-1% of total), which is why they use dramatically less VRAM.

Example calculation for LLaMA 3 8B full fine-tuning:

  • Model weights: 8B x 2 bytes = 16 GB
  • Optimiser states: 8B x 8 bytes = 64 GB
  • Gradients: 8B x 2 bytes = 16 GB
  • Activations: ~4 GB
  • Total: ~100 GB (with gradient checkpointing reducing activations)

Quick Reference Tables

Pre-calculated VRAM for common model sizes. Assumes sequence length 512, batch size 4 (via gradient accumulation), LoRA rank 16. Measured on GigaGPU servers with PyTorch.

QLoRA VRAM (most common method)

ModelParamsSeq 512Seq 1024Seq 2048
Mistral 7B7.2B~13 GB~15 GB~19 GB
LLaMA 3 8B8.0B~14 GB~16 GB~21 GB
Qwen 2.5 14B14.2B~22 GB~26 GB~33 GB
LLaMA 3 70B70.6B~52 GB~60 GB~78 GB
Qwen 2.5 72B72.7B~54 GB~62 GB~80 GB

LoRA (FP16) VRAM

ModelParamsSeq 512Seq 1024Seq 2048
Mistral 7B7.2B~26 GB~30 GB~38 GB
LLaMA 3 8B8.0B~28 GB~33 GB~42 GB
Qwen 2.5 14B14.2B~45 GB~52 GB~66 GB
LLaMA 3 70B70.6B~180 GB~200 GB~240 GB

Full Fine-Tuning VRAM

ModelParamsSeq 512 (w/ grad ckpt)Without Grad Ckpt
Mistral 7B7.2B~72 GB~90 GB
LLaMA 3 8B8.0B~80 GB~100 GB
LLaMA 3 70B70.6B~700 GB~900 GB

Variables That Change Your VRAM

  • Sequence length: longer sequences increase activation memory. Doubling from 512 to 1024 adds roughly 15-25% to total VRAM. Use gradient checkpointing to reduce this at the cost of ~20% slower training.
  • Batch size: each additional sample in the batch stores its own activations. Use gradient accumulation (small per-GPU batch, many accumulation steps) to simulate large batches without proportionally increasing VRAM.
  • LoRA rank: higher rank means more trainable parameters. Rank 16 uses ~0.25% of params; rank 64 uses ~1%. Doubling rank roughly doubles the adapter VRAM but not the base model VRAM.
  • Target modules: applying LoRA to MLP layers (gate, up, down projections) in addition to attention roughly doubles the trainable parameter count and VRAM for adapters.
  • Optimiser: Adam uses 8 bytes/param. AdaFactor uses ~4 bytes/param. SGD uses ~4 bytes/param. Switching from Adam to 8-bit Adam (bitsandbytes) halves optimiser memory.

GPU Mapping Guide

VRAM BudgetGPU OptionsWhat You Can Fine-Tune
8-10 GBRTX 40607-8B QLoRA (r=8, seq 512)
16 GBRTX 4060 Ti, RTX 50807-8B QLoRA (r=64, seq 1024)
24 GBRTX 30907-8B LoRA (FP16), 14B QLoRA
32 GBRTX 509014B LoRA (FP16), 70B QLoRA (tight)
48 GB2x RTX 309014B full FT (tight), 70B QLoRA
64 GB2x RTX 509070B QLoRA comfortably
80 GBRTX 6000 Pro7-8B full FT, 70B QLoRA with headroom
160+ GBMulti-GPU cluster70B LoRA (FP16), 70B full FT

For specific model guides, see LLaMA 3 8B fine-tuning, Mistral 7B fine-tuning, and DeepSeek fine-tuning. Browse all tutorials in the Tutorials category.

Conclusion

VRAM planning prevents wasted time and money. QLoRA is the most accessible method, putting 7B model fine-tuning within reach of 16 GB GPUs and 70B models within 64 GB. LoRA at FP16 requires roughly double the VRAM but produces marginally higher-quality adapters. Full fine-tuning demands 8-10x the model weight size and is practical only on RTX 6000 Pro/RTX 6000 Pro-class hardware. Calculate before you deploy, and you will pick the right GPU server the first time.

Get the Right GPU for Fine-Tuning

Dedicated servers from 8 GB to multi-GPU clusters. Pre-configured with PyTorch, CUDA, and PEFT libraries.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?