RTX 3050 - Order Now
Home / Blog / Tutorials / QLoRA Fine-Tuning on the RTX 5060 Ti 16 GB: A Practical Guide for 7B Models
Tutorials

QLoRA Fine-Tuning on the RTX 5060 Ti 16 GB: A Practical Guide for 7B Models

How to fine-tune Llama 3 8B, Mistral 7B and Qwen 2.5 7B on a single RTX 5060 Ti 16 GB using QLoRA — with the flags, hyperparameters and gotchas we have actually run into.

Fine-tuning on a 16 GB GPU was painful with full SFT (won’t fit) and awkward with LoRA (tight). QLoRA — 4-bit quantised base + bf16 LoRA adapters — makes 7B fine-tuning genuinely comfortable on a 5060 Ti. This is the playbook.

TL;DR

QLoRA on a 5060 Ti can fine-tune Llama 3.1 8B / Mistral 7B / Qwen 2.5 7B with rank 64 on a typical SFT dataset (10K samples, 2K context) in ~6 hours. Peak VRAM ~13 GB. Adapter is 100–400 MB; merge back to base for inference.

VRAM budget for QLoRA on 16 GB

ComponentLlama 3.1 8B QLoRA r=64Llama 3.1 8B QLoRA r=128
Base model (NF4 quant)5 GB5 GB
LoRA adapters (BF16)~140 MB~280 MB
Optimizer states (paged 8-bit AdamW)~280 MB~560 MB
Gradients~140 MB~280 MB
Activations (seq=2048, batch=4)~6 GB~6.5 GB
Peak VRAM~12 GB~13 GB

Comfortable on a 16 GB card. For r=256 or batch=8 you start scraping the limit.

Setup: bitsandbytes + transformers + PEFT

pip install transformers==4.45 peft==0.13 trl==0.11 \
  bitsandbytes==0.44 accelerate==1.0 datasets==3.0
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
import torch

bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    quantization_config=bnb_cfg,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

lora_cfg = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_cfg)

tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
ds = load_dataset("HuggingFaceH4/no_robots", split="train")

trainer = SFTTrainer(
    model=model, tokenizer=tok, train_dataset=ds,
    max_seq_length=2048,
    args={"per_device_train_batch_size":4, "gradient_accumulation_steps":4,
          "num_train_epochs":3, "learning_rate":2e-4,
          "optim":"paged_adamw_8bit", "fp16":False, "bf16":True,
          "logging_steps":10, "save_strategy":"epoch",
          "output_dir":"./out"},
)
trainer.train()
trainer.save_model("./adapter")

Hyperparameters that matter

  • r=64 — sweet spot for instruction fine-tuning. r=16 underfits, r=256 overfits and costs more memory.
  • lora_alpha = 2 × r — standard formula. With r=64, alpha=128.
  • target_modules: all linear projections — q,k,v,o,gate,up,down. Skipping gate/up/down for "memory savings" is a false economy on 16 GB.
  • learning_rate=2e-4 — standard QLoRA. Schedule: linear warmup 100 steps, then constant.
  • gradient_accumulation_steps=4 — gives effective batch size 16 with per_device_batch=4.
  • paged_adamw_8bit — non-negotiable. Saves ~3 GB vs full AdamW.

Training time on real datasets

DatasetSamplesSeq lenWall time on 5060 TiFinal loss
no_robots10K2048~6 h~1.05
Alpaca-cleaned52K512~14 h~0.9
UltraChat-200K200K (sample 30K)2048~16 h~0.95
Custom domain (5K samples)5K4096~5 h~0.6

Merging the adapter back to base

For inference deployment, merge the LoRA back into the base model and serve with vLLM:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    torch_dtype="bfloat16",
)
merged = PeftModel.from_pretrained(base, "./adapter").merge_and_unload()
merged.save_pretrained("./llama-3.1-8b-tuned")
AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct") \
    .save_pretrained("./llama-3.1-8b-tuned")

Then deploy with vLLM as you would any 8B model. Or skip the merge and use vLLM’s --enable-lora flag to serve the adapter directly — useful if you have multiple per-customer LoRAs.

Verdict

The RTX 5060 Ti 16 GB is the cheapest dedicated GPU we host that runs production-quality QLoRA fine-tuning of 7B–8B models. Training-overnight is the typical pattern. For larger models (14B+) or full SFT, step up to a 5090 or 6000 Pro.

Bottom line

QLoRA at r=64 on a 5060 Ti is the right entry-tier fine-tuning workflow for 7B–8B models. ~6 hours of training, <13 GB peak VRAM, 100 MB adapter. For more on the broader fine-tuning landscape see 5060 Ti fine-tune throughput.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?