RTX 3050 - Order Now
Home / Blog / Tutorials / Fine-Tune LLaMA 3 8B with LoRA: GPU & VRAM Guide
Tutorials

Fine-Tune LLaMA 3 8B with LoRA: GPU & VRAM Guide

Step-by-step guide to fine-tuning LLaMA 3 8B with LoRA and QLoRA, including VRAM requirements, GPU recommendations, training time estimates, and cost analysis.

Why Fine-Tune LLaMA 3 8B

LLaMA 3 8B is one of the best models for fine-tuning: large enough to perform well on domain-specific tasks, small enough to train on a single dedicated GPU server. LoRA (Low-Rank Adaptation) makes this practical by training only a small fraction of the parameters, cutting VRAM requirements by 60-80% compared to full fine-tuning. This guide covers everything from hardware selection to cost estimation.

For baseline inference requirements see our LLaMA 3 VRAM requirements guide. For a comparison of fine-tuning methods, read LoRA vs QLoRA vs full fine-tuning.

VRAM Requirements by Method

Fine-tuning VRAM includes the model weights, optimiser states, gradients, and activations. LoRA and QLoRA dramatically reduce this by only training small adapter matrices.

MethodBase PrecisionTrainable ParamsVRAM (batch=1)VRAM (batch=4)Minimum GPU
Full fine-tuningFP168B (100%)~65 GB~80 GB2x RTX 5090 or RTX 6000 Pro 96 GB
LoRA (r=16)FP16~20M (0.25%)~22 GB~28 GBRTX 3090 (24 GB)
LoRA (r=64)FP16~80M (1%)~24 GB~32 GBRTX 3090 or RTX 5090
QLoRA (r=16)INT4~20M (0.25%)~10 GB~14 GBRTX 4060 Ti (16 GB)
QLoRA (r=64)INT4~80M (1%)~12 GB~18 GBRTX 4060 Ti (16 GB)

QLoRA loads the base model in INT4 (via bitsandbytes) while training the LoRA adapters in FP16. This makes fine-tuning possible on GPUs with as little as 10 GB VRAM. For broader VRAM planning, see our fine-tuning VRAM calculator.

Setup and Configuration

The following configuration uses the Hugging Face PEFT library with the transformers trainer. Install on your PyTorch-enabled GPU server.

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    load_in_4bit=True,  # Remove for FP16 LoRA
    device_map="auto"
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 19,988,480 (0.24% of 8B)

Key hyperparameters to tune:

  • Rank (r): 16 is a good default. Increase to 32-64 for complex domain adaptation; decrease to 8 for simple style transfer.
  • Target modules: attention projections (q, k, v, o) are standard. Adding MLP layers (gate_proj, up_proj, down_proj) improves results but increases VRAM.
  • Learning rate: 1e-4 to 3e-4 for LoRA; 2e-4 is a reliable starting point.
  • Batch size: use gradient accumulation to simulate larger batches without increasing VRAM. Effective batch size of 32-64 typically works well.

Training Time by GPU

Estimated training time for 1,000 examples at sequence length 512, 3 epochs, using QLoRA (r=16, batch=4 with gradient accumulation).

GPUVRAMQLoRA TimeLoRA (FP16) Time
RTX 4060 Ti16 GB~45 minN/A (insufficient VRAM)
RTX 309024 GB~25 min~35 min
RTX 508016 GB~20 minN/A (insufficient VRAM)
RTX 509032 GB~12 min~18 min
RTX 6000 Pro 96 GB80 GB~8 min~10 min

Scaling to larger datasets: 10K examples takes roughly 10x the above times, and 100K examples takes ~100x. For detailed timing across more GPUs, see our fine-tuning time by GPU benchmarks. For GPU selection advice, check the best GPU for fine-tuning LLMs guide.

Cost Estimates

Based on GigaGPU hourly rates for dedicated GPU servers.

GPUApprox. Hourly RateCost for 1K ExamplesCost for 10K Examples
RTX 4060 Ti~£0.10/hr~£0.08~£0.75
RTX 3090~£0.15/hr~£0.06~£0.63
RTX 5090~£0.35/hr~£0.07~£0.70
RTX 6000 Pro 96 GB~£1.20/hr~£0.16~£1.60

The RTX 3090 offers the best cost-efficiency for QLoRA fine-tuning — fast enough to avoid wasting time, affordable enough to keep costs under a pound for most datasets.

Conclusion

Fine-tuning LLaMA 3 8B with QLoRA is accessible on GPUs as modest as the RTX 4060 Ti (16 GB), and costs under £1 for datasets up to 10K examples. LoRA at FP16 requires 24+ GB but produces slightly better quality adapters. For production fine-tuning workflows, a dedicated LLaMA hosting server with an RTX 3090 or RTX 5090 is the sweet spot between cost and performance.

Fine-Tune LLaMA 3 on Dedicated GPUs

GPU servers with PyTorch, CUDA, and PEFT pre-installed. Ready for LoRA and QLoRA fine-tuning out of the box.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?