Table of Contents
Fine-tuning on a 16 GB GPU was painful with full SFT (won’t fit) and awkward with LoRA (tight). QLoRA — 4-bit quantised base + bf16 LoRA adapters — makes 7B fine-tuning genuinely comfortable on a 5060 Ti. This is the playbook.
QLoRA on a 5060 Ti can fine-tune Llama 3.1 8B / Mistral 7B / Qwen 2.5 7B with rank 64 on a typical SFT dataset (10K samples, 2K context) in ~6 hours. Peak VRAM ~13 GB. Adapter is 100–400 MB; merge back to base for inference.
VRAM budget for QLoRA on 16 GB
| Component | Llama 3.1 8B QLoRA r=64 | Llama 3.1 8B QLoRA r=128 |
|---|---|---|
| Base model (NF4 quant) | 5 GB | 5 GB |
| LoRA adapters (BF16) | ~140 MB | ~280 MB |
| Optimizer states (paged 8-bit AdamW) | ~280 MB | ~560 MB |
| Gradients | ~140 MB | ~280 MB |
| Activations (seq=2048, batch=4) | ~6 GB | ~6.5 GB |
| Peak VRAM | ~12 GB | ~13 GB |
Comfortable on a 16 GB card. For r=256 or batch=8 you start scraping the limit.
Setup: bitsandbytes + transformers + PEFT
pip install transformers==4.45 peft==0.13 trl==0.11 \
bitsandbytes==0.44 accelerate==1.0 datasets==3.0
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
import torch
bnb_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
quantization_config=bnb_cfg,
device_map="auto",
)
model = prepare_model_for_kbit_training(model)
lora_cfg = LoraConfig(
r=64,
lora_alpha=128,
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_cfg)
tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
ds = load_dataset("HuggingFaceH4/no_robots", split="train")
trainer = SFTTrainer(
model=model, tokenizer=tok, train_dataset=ds,
max_seq_length=2048,
args={"per_device_train_batch_size":4, "gradient_accumulation_steps":4,
"num_train_epochs":3, "learning_rate":2e-4,
"optim":"paged_adamw_8bit", "fp16":False, "bf16":True,
"logging_steps":10, "save_strategy":"epoch",
"output_dir":"./out"},
)
trainer.train()
trainer.save_model("./adapter")
Hyperparameters that matter
- r=64 — sweet spot for instruction fine-tuning. r=16 underfits, r=256 overfits and costs more memory.
- lora_alpha = 2 × r — standard formula. With r=64, alpha=128.
- target_modules: all linear projections — q,k,v,o,gate,up,down. Skipping gate/up/down for "memory savings" is a false economy on 16 GB.
- learning_rate=2e-4 — standard QLoRA. Schedule: linear warmup 100 steps, then constant.
- gradient_accumulation_steps=4 — gives effective batch size 16 with per_device_batch=4.
- paged_adamw_8bit — non-negotiable. Saves ~3 GB vs full AdamW.
Training time on real datasets
| Dataset | Samples | Seq len | Wall time on 5060 Ti | Final loss |
|---|---|---|---|---|
| no_robots | 10K | 2048 | ~6 h | ~1.05 |
| Alpaca-cleaned | 52K | 512 | ~14 h | ~0.9 |
| UltraChat-200K | 200K (sample 30K) | 2048 | ~16 h | ~0.95 |
| Custom domain (5K samples) | 5K | 4096 | ~5 h | ~0.6 |
Merging the adapter back to base
For inference deployment, merge the LoRA back into the base model and serve with vLLM:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
torch_dtype="bfloat16",
)
merged = PeftModel.from_pretrained(base, "./adapter").merge_and_unload()
merged.save_pretrained("./llama-3.1-8b-tuned")
AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct") \
.save_pretrained("./llama-3.1-8b-tuned")
Then deploy with vLLM as you would any 8B model. Or skip the merge and use vLLM’s --enable-lora flag to serve the adapter directly — useful if you have multiple per-customer LoRAs.
Verdict
The RTX 5060 Ti 16 GB is the cheapest dedicated GPU we host that runs production-quality QLoRA fine-tuning of 7B–8B models. Training-overnight is the typical pattern. For larger models (14B+) or full SFT, step up to a 5090 or 6000 Pro.
Bottom line
QLoRA at r=64 on a 5060 Ti is the right entry-tier fine-tuning workflow for 7B–8B models. ~6 hours of training, <13 GB peak VRAM, 100 MB adapter. For more on the broader fine-tuning landscape see 5060 Ti fine-tune throughput.