Home / Blog / Tutorials / Fine-Tune LLaMA 3 8B with LoRA: GPU & VRAM Guide

Tutorials

Fine-Tune LLaMA 3 8B with LoRA: GPU & VRAM Guide

Step-by-step guide to fine-tuning LLaMA 3 8B with LoRA and QLoRA, including VRAM requirements, GPU recommendations, training time estimates, and cost analysis.

Tutorials April 17, 2026 3 min read gigagpu

Table of Contents

Why Fine-Tune LLaMA 3 8B
VRAM Requirements by Method
Setup and Configuration
Training Time by GPU
Cost Estimates
Conclusion

Why Fine-Tune LLaMA 3 8B

LLaMA 3 8B is one of the best models for fine-tuning: large enough to perform well on domain-specific tasks, small enough to train on a single dedicated GPU server. LoRA (Low-Rank Adaptation) makes this practical by training only a small fraction of the parameters, cutting VRAM requirements by 60-80% compared to full fine-tuning. This guide covers everything from hardware selection to cost estimation.

For baseline inference requirements see our LLaMA 3 VRAM requirements guide. For a comparison of fine-tuning methods, read LoRA vs QLoRA vs full fine-tuning.

VRAM Requirements by Method

Fine-tuning VRAM includes the model weights, optimiser states, gradients, and activations. LoRA and QLoRA dramatically reduce this by only training small adapter matrices.

Method	Base Precision	Trainable Params	VRAM (batch=1)	VRAM (batch=4)	Minimum GPU
Full fine-tuning	FP16	8B (100%)	~65 GB	~80 GB	2x RTX 5090 or RTX 6000 Pro 96 GB
LoRA (r=16)	FP16	~20M (0.25%)	~22 GB	~28 GB	RTX 3090 (24 GB)
LoRA (r=64)	FP16	~80M (1%)	~24 GB	~32 GB	RTX 3090 or RTX 5090
QLoRA (r=16)	INT4	~20M (0.25%)	~10 GB	~14 GB	RTX 4060 Ti (16 GB)
QLoRA (r=64)	INT4	~80M (1%)	~12 GB	~18 GB	RTX 4060 Ti (16 GB)

QLoRA loads the base model in INT4 (via bitsandbytes) while training the LoRA adapters in FP16. This makes fine-tuning possible on GPUs with as little as 10 GB VRAM. For broader VRAM planning, see our fine-tuning VRAM calculator.

Setup and Configuration

The following configuration uses the Hugging Face PEFT library with the transformers trainer. Install on your PyTorch-enabled GPU server.

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    load_in_4bit=True,  # Remove for FP16 LoRA
    device_map="auto"
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 19,988,480 (0.24% of 8B)

Key hyperparameters to tune:

Rank (r): 16 is a good default. Increase to 32-64 for complex domain adaptation; decrease to 8 for simple style transfer.
Target modules: attention projections (q, k, v, o) are standard. Adding MLP layers (gate_proj, up_proj, down_proj) improves results but increases VRAM.
Learning rate: 1e-4 to 3e-4 for LoRA; 2e-4 is a reliable starting point.
Batch size: use gradient accumulation to simulate larger batches without increasing VRAM. Effective batch size of 32-64 typically works well.

Training Time by GPU

Estimated training time for 1,000 examples at sequence length 512, 3 epochs, using QLoRA (r=16, batch=4 with gradient accumulation).

GPU	VRAM	QLoRA Time	LoRA (FP16) Time
RTX 4060 Ti	16 GB	~45 min	N/A (insufficient VRAM)
RTX 3090	24 GB	~25 min	~35 min
RTX 5080	16 GB	~20 min	N/A (insufficient VRAM)
RTX 5090	32 GB	~12 min	~18 min
RTX 6000 Pro 96 GB	80 GB	~8 min	~10 min

Scaling to larger datasets: 10K examples takes roughly 10x the above times, and 100K examples takes ~100x. For detailed timing across more GPUs, see our fine-tuning time by GPU benchmarks. For GPU selection advice, check the best GPU for fine-tuning LLMs guide.

Cost Estimates

Based on GigaGPU hourly rates for dedicated GPU servers.

GPU	Approx. Hourly Rate	Cost for 1K Examples	Cost for 10K Examples
RTX 4060 Ti	~£0.10/hr	~£0.08	~£0.75
RTX 3090	~£0.15/hr	~£0.06	~£0.63
RTX 5090	~£0.35/hr	~£0.07	~£0.70
RTX 6000 Pro 96 GB	~£1.20/hr	~£0.16	~£1.60

The RTX 3090 offers the best cost-efficiency for QLoRA fine-tuning — fast enough to avoid wasting time, affordable enough to keep costs under a pound for most datasets.

Conclusion

Fine-tuning LLaMA 3 8B with QLoRA is accessible on GPUs as modest as the RTX 4060 Ti (16 GB), and costs under £1 for datasets up to 10K examples. LoRA at FP16 requires 24+ GB but produces slightly better quality adapters. For production fine-tuning workflows, a dedicated LLaMA hosting server with an RTX 3090 or RTX 5090 is the sweet spot between cost and performance.

Fine-Tune LLaMA 3 on Dedicated GPUs

GPU servers with PyTorch, CUDA, and PEFT pre-installed. Ready for LoRA and QLoRA fine-tuning out of the box.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Fine-Tune LLaMA 3 8B with LoRA: GPU & VRAM Guide

Why Fine-Tune LLaMA 3 8B

VRAM Requirements by Method

Setup and Configuration

Training Time by GPU

Cost Estimates

Conclusion

Fine-Tune LLaMA 3 on Dedicated GPUs

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Fine-Tune LLaMA 3 8B with LoRA: GPU & VRAM Guide

Why Fine-Tune LLaMA 3 8B

VRAM Requirements by Method

Setup and Configuration

Training Time by GPU

Cost Estimates

Conclusion

Fine-Tune LLaMA 3 on Dedicated GPUs

Need a Dedicated GPU Server?

gigagpu

Related Articles

Connect RabbitMQ to AI Queue on GPU

nvidia-smi Shows No Devices: Troubleshooting

Migrate from Together.ai to Dedicated GPU: Batch Processing

E5-Mistral-7B Embedding Model Self-Hosted

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?