Table of Contents
Why GPU Choice Is Critical for Fine-Tuning
Fine-tuning an LLM is the most VRAM-intensive task in the AI pipeline. Inference needs to hold the model weights and a KV cache. Training needs the weights, gradients, optimiser states, and activation memory. A model that infers comfortably on a GPU will often OOM during fine-tuning on the same card. Choosing the right dedicated GPU server for training saves hours of debugging memory errors.
We benchmarked LoRA (parameter-efficient) and full fine-tuning across six GPUs using PyTorch and the Hugging Face Transformers + PEFT stack. This guide covers the practical details that spec sheets leave out.
VRAM Requirements: LoRA vs Full Fine-Tuning
The VRAM difference between LoRA and full fine-tuning is dramatic. Here is what each method requires for popular model sizes.
| Model | LoRA (rank 16) VRAM | LoRA (rank 64) VRAM | Full FT (AdamW) VRAM | Full FT (bf16+DeepSpeed) VRAM |
|---|---|---|---|---|
| Phi-3 Mini 3.8B | ~8 GB | ~10 GB | ~28 GB | ~18 GB |
| Llama 3 8B | ~14 GB | ~18 GB | ~62 GB | ~38 GB |
| Mistral 7B v0.3 | ~13 GB | ~17 GB | ~56 GB | ~34 GB |
| Qwen 2.5 14B | ~22 GB | ~30 GB | ~112 GB | ~68 GB |
| Llama 3 70B | ~48 GB | ~72 GB | ~520 GB | ~280 GB |
Key takeaway: LoRA at rank 16 makes 7-8B fine-tuning possible on a single 24 GB GPU. Full fine-tuning of even a 3.8B model needs more than 24 GB without memory optimisations. For anything above 8B, you need multi-GPU clusters.
LoRA Fine-Tuning Benchmarks
We fine-tuned Llama 3 8B with LoRA (rank 16, alpha 32) on the Alpaca dataset (52K samples) using bf16 mixed precision, batch size 4 with gradient accumulation of 4 (effective batch 16), and 3 epochs.
| GPU | VRAM | Peak VRAM Used | Training Time | Samples/sec | Server $/hr |
|---|---|---|---|---|---|
| RTX 5090 | 32 GB | 14.2 GB | 38 min | 68.4 | $1.80 |
| RTX 5090 | 24 GB | 14.1 GB | 52 min | 50.0 | $1.10 |
| RTX 3090 | 24 GB | 14.3 GB | 78 min | 33.3 | $0.45 |
| RTX 5080 | 16 GB | 14.0 GB | 55 min | 47.3 | $0.85 |
| RTX 6000 Pro | 48 GB | 14.2 GB | 48 min | 54.2 | $1.30 |
| RTX 6000 Pro | 48 GB | 14.2 GB | 68 min | 38.2 | $0.90 |
The RTX 5080 just barely fits this workload at 14 GB peak. Any higher LoRA rank, longer sequence length, or larger batch size would push it over. The RTX 3090 and RTX 5090 both have 10 GB of headroom, making them safer choices. Note that the RTX 4060 with 8 GB cannot run this workload at all.
LoRA Rank 64 (Higher Quality Fine-Tune)
| GPU | Peak VRAM (Llama 3 8B, r=64) | Status |
|---|---|---|
| RTX 5090 | 18.4 GB | Runs fine |
| RTX 5090 | 18.3 GB | Runs fine |
| RTX 3090 | 18.5 GB | Runs fine (5.5 GB headroom) |
| RTX 5080 | OOM | Exceeds 16 GB |
| RTX 6000 Pro | 18.3 GB | Runs fine |
At LoRA rank 64, the RTX 5080 drops out. For higher-quality LoRA fine-tuning, you need at least 24 GB. This is another scenario where the RTX 3090’s 24 GB proves its worth despite being an older architecture.
Full Fine-Tuning Benchmarks
Full parameter fine-tuning requires dramatically more VRAM. Even with bf16 and gradient checkpointing, a single GPU can only handle small models.
| Model | RTX 3090 (24 GB) | RTX 5090 (24 GB) | RTX 5090 (32 GB) | RTX 6000 Pro (48 GB) |
|---|---|---|---|---|
| Phi-3 Mini 3.8B | OOM (needs ~28 GB) | OOM | Runs (31 min/epoch) | Runs (28 min/epoch) |
| Llama 3 8B | OOM | OOM | OOM (needs ~38 GB) | Runs (72 min/epoch) |
| Mistral 7B | OOM | OOM | OOM (needs ~34 GB) | Runs (65 min/epoch) |
Full fine-tuning of 7-8B models requires 34-38 GB even with optimisations. Only the RTX 6000 Pro (48 GB) or multi-GPU setups handle this on a single card. For most teams, LoRA is the practical choice unless you have enterprise-grade hardware. Our self-host LLM guide covers both inference and fine-tuning setup end to end.
Cost per Training Run
What does a complete LoRA fine-tuning run cost on each GPU? Using the Llama 3 8B / Alpaca / 3-epoch setup from above.
| GPU | Training Time | Server $/hr | Cost per Run | Cost per Epoch |
|---|---|---|---|---|
| RTX 3090 | 78 min | $0.45 | $0.59 | $0.20 |
| RTX 5080 | 55 min | $0.85 | $0.78 | $0.26 |
| RTX 5090 | 52 min | $1.10 | $0.95 | $0.32 |
| RTX 6000 Pro | 68 min | $0.90 | $1.02 | $0.34 |
| RTX 5090 | 38 min | $1.80 | $1.14 | $0.38 |
| RTX 6000 Pro | 48 min | $1.30 | $1.04 | $0.35 |
The RTX 3090 comes in at $0.59 per complete training run, nearly half the cost of the next cheapest option. When you are iterating on hyperparameters and running dozens of experiments, this adds up fast. Cross-reference with our cheapest GPU for AI inference analysis to find a GPU that handles both training and serving.
GPU Recommendations
LoRA fine-tuning (7-8B models):
- Best value: RTX 3090 — $0.59/run, handles rank 16-64, 24 GB for safety margin
- Fastest: RTX 5090 — 38 min/run, but 1.93x more expensive per run
- Avoid: RTX 4060 (8 GB OOM), RTX 5080 (16 GB too tight for rank 64+)
LoRA fine-tuning (13-14B models):
- Minimum: RTX 3090 or RTX 5090 (24 GB, rank 16 only)
- Comfortable: RTX 5090 (32 GB) or RTX 6000 Pro (48 GB)
- For higher ranks: Multi-GPU clusters
Full fine-tuning:
- 3.8B models: RTX 5090 (32 GB, single card)
- 7-8B models: RTX 6000 Pro (48 GB) or dual-GPU setup
- 14B+ models: Multi-GPU is mandatory
After fine-tuning, you will want to serve your model efficiently. See our vLLM vs Ollama comparison and the RTX 3090 vs RTX 5090 inference benchmarks for deployment guidance. Browse all GPU options in the GPU comparisons category.
Fine-Tune Your Own LLM Today
Get a dedicated GPU server with PyTorch, CUDA, and Hugging Face pre-installed. Start fine-tuning in minutes, not hours.
Browse GPU ServersFAQ
Can I fine-tune Llama 3 8B on a single RTX 3090?
Yes, with LoRA. At rank 16 it uses ~14 GB, and at rank 64 it uses ~18 GB, both well within the 3090’s 24 GB. Full parameter fine-tuning requires 38+ GB and is not possible on a single 24 GB card.
Is QLoRA worth using to save VRAM?
QLoRA (quantised base weights + LoRA adapters) reduces VRAM by ~30-40% compared to standard LoRA with FP16 base weights. It is a good option for fitting 14B models on 24 GB cards, though training is ~15% slower due to the quantisation overhead.
How long does a typical fine-tuning run take?
On an RTX 3090 with LoRA rank 16 and the 52K-sample Alpaca dataset, expect about 78 minutes for 3 epochs. Larger datasets scale linearly. A 500K-sample dataset would take roughly 12-13 hours.
Should I fine-tune or use RAG?
Fine-tuning is best for teaching a model new behaviour, style, or domain-specific reasoning. RAG (retrieval-augmented generation) is better for giving a model access to specific facts or documents. Many production systems use both. Our self-host LLM guide covers both approaches, and the DeepSeek deployment guide shows a practical example.