LoRA is the default parameter-efficient fine-tune method for 2026 – it trains small rank-decomposed adapters rather than updating the full base model, slashing VRAM and time. On the RTX 5060 Ti 16GB via our dedicated GPU hosting you can LoRA-fine-tune Llama 3 8B or Mistral 7B overnight on a few thousand examples. This guide walks the full pipeline end to end.
Contents
- Why LoRA on 16 GB
- Data preparation
- Training config
- Training code
- Expected wall-clock
- Merge and deploy
Why LoRA on 16 GB
Full 8B fine-tune needs roughly 64 GB VRAM for BF16 weights, optimiser state and activations. LoRA freezes the base and trains two low-rank matrices per attention layer – VRAM drops to ~13 GB for Llama 3 8B in BF16, comfortably fitting 16 GB.
| Component | Full fine-tune | LoRA (r=16) |
|---|---|---|
| Base weights | 16 GB BF16 | 16 GB BF16 (frozen) |
| Gradients | 16 GB | ~45 MB |
| Optimiser state (AdamW) | 32 GB | ~90 MB |
| Trainable params | 8 B | ~22 M |
| Activations (batch 2, 4k) | ~6 GB | ~2 GB with checkpointing |
| Total VRAM | ~70 GB | ~13 GB |
Data Preparation
Collect 500-5,000 instruction-response pairs. ChatML format is the most portable:
[
{"messages": [
{"role": "system", "content": "You are a friendly support agent."},
{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "Go to Settings..."}
]},
...
]
Save as train.jsonl (one example per line) and a held-out eval.jsonl of 50-200 examples. Dedupe aggressively – duplicated prompts cause overfit far faster than many realise.
Training Config
| Hyperparameter | Value | Why |
|---|---|---|
| Base model | Llama 3 8B Instruct | Fits 16 GB in BF16 with LoRA |
| LoRA rank (r) | 16 | Sweet spot for <5k examples |
| LoRA alpha | 32 | Standard 2x rank ratio |
| LoRA dropout | 0.05 | Mild regularisation |
| Target modules | q,k,v,o,gate,up,down | All linear layers |
| Max seq length | 4096 | Fits most chat |
| Batch size | 2 | With grad accum = 4 -> effective 8 |
| Learning rate | 2e-4 | Typical LoRA range |
| Epochs | 3 | Watch eval loss |
| Precision | BF16 | Blackwell-native |
| Gradient checkpointing | Unsloth | Saves ~40% activations |
Training Code
Use Unsloth for Blackwell-optimised kernels (roughly 2x vanilla PEFT speed):
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
model, tok = FastLanguageModel.from_pretrained(
"unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=4096,
dtype="bfloat16",
load_in_4bit=False, # full BF16 base on 16 GB
)
model = FastLanguageModel.get_peft_model(
model,
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
use_gradient_checkpointing="unsloth",
random_state=42,
)
ds = load_dataset("json", data_files={"train":"train.jsonl","eval":"eval.jsonl"})
trainer = SFTTrainer(
model=model, tokenizer=tok,
train_dataset=ds["train"], eval_dataset=ds["eval"],
args=SFTConfig(
output_dir="./llama3-lora",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
eval_strategy="epoch",
save_strategy="epoch",
warmup_ratio=0.03,
),
)
trainer.train()
model.save_pretrained("./llama3-lora/final")
Expected Wall-Clock
| Dataset size | Tokens | Time per epoch | 3 epochs |
|---|---|---|---|
| 500 examples | ~0.5 M | ~6 min | ~18 min |
| 2,000 examples | ~2 M | ~25 min | ~75 min |
| 5,000 examples | ~5 M | ~60 min | ~3 h |
| 20,000 examples | ~20 M | ~4 h | ~12 h (overnight) |
Merge and Deploy
Two options for serving. Merge for single-adapter deployments; LoRAX for multi-adapter SaaS:
# Merge adapter into base for a clean vLLM deployment
python -c "from unsloth import FastLanguageModel; \
m,t=FastLanguageModel.from_pretrained('./llama3-lora/final'); \
m.save_pretrained_merged('./llama3-merged', t, save_method='merged_16bit')"
# Serve
python -m vllm.entrypoints.openai.api_server \
--model ./llama3-merged --served-model-name my-llama
LoRA Fine-Tune on Blackwell 16 GB
Train Llama 3 8B overnight on dedicated hardware. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: LoRA training speed, QLoRA guide, Unsloth speed, fine-tune throughput, vLLM setup.