Table of Contents
Why Fine-Tune LLaMA 3 8B
LLaMA 3 8B is one of the best models for fine-tuning: large enough to perform well on domain-specific tasks, small enough to train on a single dedicated GPU server. LoRA (Low-Rank Adaptation) makes this practical by training only a small fraction of the parameters, cutting VRAM requirements by 60-80% compared to full fine-tuning. This guide covers everything from hardware selection to cost estimation.
For baseline inference requirements see our LLaMA 3 VRAM requirements guide. For a comparison of fine-tuning methods, read LoRA vs QLoRA vs full fine-tuning.
VRAM Requirements by Method
Fine-tuning VRAM includes the model weights, optimiser states, gradients, and activations. LoRA and QLoRA dramatically reduce this by only training small adapter matrices.
| Method | Base Precision | Trainable Params | VRAM (batch=1) | VRAM (batch=4) | Minimum GPU |
|---|---|---|---|---|---|
| Full fine-tuning | FP16 | 8B (100%) | ~65 GB | ~80 GB | 2x RTX 5090 or RTX 6000 Pro 96 GB |
| LoRA (r=16) | FP16 | ~20M (0.25%) | ~22 GB | ~28 GB | RTX 3090 (24 GB) |
| LoRA (r=64) | FP16 | ~80M (1%) | ~24 GB | ~32 GB | RTX 3090 or RTX 5090 |
| QLoRA (r=16) | INT4 | ~20M (0.25%) | ~10 GB | ~14 GB | RTX 4060 Ti (16 GB) |
| QLoRA (r=64) | INT4 | ~80M (1%) | ~12 GB | ~18 GB | RTX 4060 Ti (16 GB) |
QLoRA loads the base model in INT4 (via bitsandbytes) while training the LoRA adapters in FP16. This makes fine-tuning possible on GPUs with as little as 10 GB VRAM. For broader VRAM planning, see our fine-tuning VRAM calculator.
Setup and Configuration
The following configuration uses the Hugging Face PEFT library with the transformers trainer. Install on your PyTorch-enabled GPU server.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
load_in_4bit=True, # Remove for FP16 LoRA
device_map="auto"
)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 19,988,480 (0.24% of 8B)
Key hyperparameters to tune:
- Rank (r): 16 is a good default. Increase to 32-64 for complex domain adaptation; decrease to 8 for simple style transfer.
- Target modules: attention projections (q, k, v, o) are standard. Adding MLP layers (gate_proj, up_proj, down_proj) improves results but increases VRAM.
- Learning rate: 1e-4 to 3e-4 for LoRA; 2e-4 is a reliable starting point.
- Batch size: use gradient accumulation to simulate larger batches without increasing VRAM. Effective batch size of 32-64 typically works well.
Training Time by GPU
Estimated training time for 1,000 examples at sequence length 512, 3 epochs, using QLoRA (r=16, batch=4 with gradient accumulation).
| GPU | VRAM | QLoRA Time | LoRA (FP16) Time |
|---|---|---|---|
| RTX 4060 Ti | 16 GB | ~45 min | N/A (insufficient VRAM) |
| RTX 3090 | 24 GB | ~25 min | ~35 min |
| RTX 5080 | 16 GB | ~20 min | N/A (insufficient VRAM) |
| RTX 5090 | 32 GB | ~12 min | ~18 min |
| RTX 6000 Pro 96 GB | 80 GB | ~8 min | ~10 min |
Scaling to larger datasets: 10K examples takes roughly 10x the above times, and 100K examples takes ~100x. For detailed timing across more GPUs, see our fine-tuning time by GPU benchmarks. For GPU selection advice, check the best GPU for fine-tuning LLMs guide.
Cost Estimates
Based on GigaGPU hourly rates for dedicated GPU servers.
| GPU | Approx. Hourly Rate | Cost for 1K Examples | Cost for 10K Examples |
|---|---|---|---|
| RTX 4060 Ti | ~£0.10/hr | ~£0.08 | ~£0.75 |
| RTX 3090 | ~£0.15/hr | ~£0.06 | ~£0.63 |
| RTX 5090 | ~£0.35/hr | ~£0.07 | ~£0.70 |
| RTX 6000 Pro 96 GB | ~£1.20/hr | ~£0.16 | ~£1.60 |
The RTX 3090 offers the best cost-efficiency for QLoRA fine-tuning — fast enough to avoid wasting time, affordable enough to keep costs under a pound for most datasets.
Conclusion
Fine-tuning LLaMA 3 8B with QLoRA is accessible on GPUs as modest as the RTX 4060 Ti (16 GB), and costs under £1 for datasets up to 10K examples. LoRA at FP16 requires 24+ GB but produces slightly better quality adapters. For production fine-tuning workflows, a dedicated LLaMA hosting server with an RTX 3090 or RTX 5090 is the sweet spot between cost and performance.
Fine-Tune LLaMA 3 on Dedicated GPUs
GPU servers with PyTorch, CUDA, and PEFT pre-installed. Ready for LoRA and QLoRA fine-tuning out of the box.
Browse GPU Servers