Table of Contents
Why VRAM Calculation Matters
Running out of VRAM mid-training wastes hours and money. Before spinning up a dedicated GPU server for fine-tuning, you need to know exactly how much memory your training run will require. The VRAM footprint depends on the model size, fine-tuning method, sequence length, batch size, and precision — and the interactions between these variables are not always intuitive.
This guide provides formulas and reference tables so you can calculate your VRAM needs before committing to hardware. For method-specific details, see our LoRA vs QLoRA vs full fine-tuning comparison.
The VRAM Formula
Total training VRAM has four components:
Total VRAM = Model Weights + Optimiser States + Gradients + Activations
| Component | Full FT (FP16) | LoRA (FP16) | QLoRA (INT4) |
|---|---|---|---|
| Model weights | 2 bytes x params | 2 bytes x params | 0.5 bytes x params |
| Optimiser states | 8 bytes x params | 8 bytes x trainable | 8 bytes x trainable |
| Gradients | 2 bytes x params | 2 bytes x trainable | 2 bytes x trainable |
| Activations | ~2-4 GB (varies) | ~1-2 GB | ~1-2 GB |
For full fine-tuning, the optimiser (Adam) stores first and second moment estimates (8 bytes per parameter), making it the dominant cost. LoRA and QLoRA only compute optimiser states for the trainable adapter parameters (typically 0.1-1% of total), which is why they use dramatically less VRAM.
Example calculation for LLaMA 3 8B full fine-tuning:
- Model weights: 8B x 2 bytes = 16 GB
- Optimiser states: 8B x 8 bytes = 64 GB
- Gradients: 8B x 2 bytes = 16 GB
- Activations: ~4 GB
- Total: ~100 GB (with gradient checkpointing reducing activations)
Quick Reference Tables
Pre-calculated VRAM for common model sizes. Assumes sequence length 512, batch size 4 (via gradient accumulation), LoRA rank 16. Measured on GigaGPU servers with PyTorch.
QLoRA VRAM (most common method)
| Model | Params | Seq 512 | Seq 1024 | Seq 2048 |
|---|---|---|---|---|
| Mistral 7B | 7.2B | ~13 GB | ~15 GB | ~19 GB |
| LLaMA 3 8B | 8.0B | ~14 GB | ~16 GB | ~21 GB |
| Qwen 2.5 14B | 14.2B | ~22 GB | ~26 GB | ~33 GB |
| LLaMA 3 70B | 70.6B | ~52 GB | ~60 GB | ~78 GB |
| Qwen 2.5 72B | 72.7B | ~54 GB | ~62 GB | ~80 GB |
LoRA (FP16) VRAM
| Model | Params | Seq 512 | Seq 1024 | Seq 2048 |
|---|---|---|---|---|
| Mistral 7B | 7.2B | ~26 GB | ~30 GB | ~38 GB |
| LLaMA 3 8B | 8.0B | ~28 GB | ~33 GB | ~42 GB |
| Qwen 2.5 14B | 14.2B | ~45 GB | ~52 GB | ~66 GB |
| LLaMA 3 70B | 70.6B | ~180 GB | ~200 GB | ~240 GB |
Full Fine-Tuning VRAM
| Model | Params | Seq 512 (w/ grad ckpt) | Without Grad Ckpt |
|---|---|---|---|
| Mistral 7B | 7.2B | ~72 GB | ~90 GB |
| LLaMA 3 8B | 8.0B | ~80 GB | ~100 GB |
| LLaMA 3 70B | 70.6B | ~700 GB | ~900 GB |
Variables That Change Your VRAM
- Sequence length: longer sequences increase activation memory. Doubling from 512 to 1024 adds roughly 15-25% to total VRAM. Use gradient checkpointing to reduce this at the cost of ~20% slower training.
- Batch size: each additional sample in the batch stores its own activations. Use gradient accumulation (small per-GPU batch, many accumulation steps) to simulate large batches without proportionally increasing VRAM.
- LoRA rank: higher rank means more trainable parameters. Rank 16 uses ~0.25% of params; rank 64 uses ~1%. Doubling rank roughly doubles the adapter VRAM but not the base model VRAM.
- Target modules: applying LoRA to MLP layers (gate, up, down projections) in addition to attention roughly doubles the trainable parameter count and VRAM for adapters.
- Optimiser: Adam uses 8 bytes/param. AdaFactor uses ~4 bytes/param. SGD uses ~4 bytes/param. Switching from Adam to 8-bit Adam (bitsandbytes) halves optimiser memory.
GPU Mapping Guide
| VRAM Budget | GPU Options | What You Can Fine-Tune |
|---|---|---|
| 8-10 GB | RTX 4060 | 7-8B QLoRA (r=8, seq 512) |
| 16 GB | RTX 4060 Ti, RTX 5080 | 7-8B QLoRA (r=64, seq 1024) |
| 24 GB | RTX 3090 | 7-8B LoRA (FP16), 14B QLoRA |
| 32 GB | RTX 5090 | 14B LoRA (FP16), 70B QLoRA (tight) |
| 48 GB | 2x RTX 3090 | 14B full FT (tight), 70B QLoRA |
| 64 GB | 2x RTX 5090 | 70B QLoRA comfortably |
| 80 GB | RTX 6000 Pro | 7-8B full FT, 70B QLoRA with headroom |
| 160+ GB | Multi-GPU cluster | 70B LoRA (FP16), 70B full FT |
For specific model guides, see LLaMA 3 8B fine-tuning, Mistral 7B fine-tuning, and DeepSeek fine-tuning. Browse all tutorials in the Tutorials category.
Conclusion
VRAM planning prevents wasted time and money. QLoRA is the most accessible method, putting 7B model fine-tuning within reach of 16 GB GPUs and 70B models within 64 GB. LoRA at FP16 requires roughly double the VRAM but produces marginally higher-quality adapters. Full fine-tuning demands 8-10x the model weight size and is practical only on RTX 6000 Pro/RTX 6000 Pro-class hardware. Calculate before you deploy, and you will pick the right GPU server the first time.
Get the Right GPU for Fine-Tuning
Dedicated servers from 8 GB to multi-GPU clusters. Pre-configured with PyTorch, CUDA, and PEFT libraries.
Browse GPU Servers