Table of Contents
QLoRA — 4-bit base + bf16 LoRA — is the popular choice on 16 GB cards. Plain LoRA (non-quantised base) is sometimes preferable when you have the VRAM and want maximum quality. On the 5060 Ti the math is right at the edge.
Plain LoRA on the 5060 Ti works for 3B-class models (Phi-3 Mini) at FP16, or 7B models at FP8. Beyond that, QLoRA is the right path. See QLoRA guide for the more common workflow.
LoRA vs QLoRA on 16 GB
QLoRA quantises the base model to 4-bit, drastically reducing memory but introducing a small quality drop. Plain LoRA keeps the base in BF16/FP16, preserving full quality at higher VRAM cost.
| Approach | Base precision | 7B model peak VRAM | Quality vs full fine-tune |
|---|---|---|---|
| Full SFT | BF16 | ~80 GB | Reference |
| LoRA | BF16 | ~24 GB | ~95-99% |
| LoRA (FP8 base) | FP8 | ~14 GB | ~95-99% |
| QLoRA | 4-bit NF4 | ~12 GB | ~92-97% |
Plain LoRA at FP8 base fits a 16 GB card for 7B models. The quality is marginally better than QLoRA at the cost of slightly more VRAM and slightly slower training.
VRAM math
Plain LoRA r=64 on Llama 3.1 8B at FP8:
- Base model FP8: ~8 GB
- LoRA adapters BF16: ~140 MB
- Optimizer states (AdamW 8-bit): ~280 MB
- Activations (seq=2048, batch=2): ~5 GB
- Peak: ~14 GB
Tight but fits. Reduce batch size or seq_len if OOM.
Setup
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
# Load base in BF16 (or FP8 if available)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
torch_dtype="bfloat16",
device_map="auto",
)
lora_cfg = LoraConfig(
r=64, lora_alpha=128,
target_modules=["q_proj","k_proj","v_proj","o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_cfg)
trainer = SFTTrainer(
model=model,
args={"per_device_train_batch_size": 2,
"gradient_accumulation_steps": 8,
"num_train_epochs": 3,
"learning_rate": 1e-4,
"optim": "adamw_8bit",
"bf16": True},
# ... dataset, tokenizer
)
trainer.train()
Training time
Llama 3.1 8B, 10K samples, 2K seq len, 5060 Ti:
- Plain LoRA: ~7 hours
- QLoRA equivalent: ~6 hours
Verdict
Plain LoRA on a 5060 Ti is the right choice when the marginal quality matters and you can afford the slightly tighter memory budget. For most workloads QLoRA is cheaper and nearly as good.
Bottom line
Plain LoRA on a 5060 Ti is workable for 7B FP8 fine-tuning but tight. For the standard recipe see our QLoRA guide.