LoRA keeps base weights in FP16 or BF16 and trains small adapter matrices. Quality is usually slightly better than QLoRA at the cost of more VRAM. Measured on the RTX 5060 Ti 16GB via our hosting:
Contents
Stack
- Transformers 4.46, PEFT 0.13, Accelerate 1.0
- BF16 compute, FP16 base weights
- AdamW 8bit optimiser
- FlashAttention 2.6
Timings (sec/step)
| Model | Seq 1024 | Seq 2048 | Notes |
|---|---|---|---|
| Llama 3 8B (bs=2) | 0.42 | 0.88 | VRAM 14.8 GB – tight |
| Llama 3 8B (bs=1, grad acc 4) | 0.21 | 0.44 | Eff batch = 4, 11.5 GB |
| Mistral 7B (bs=2) | 0.38 | 0.80 | 13.2 GB |
| Phi-3-mini (bs=8) | 0.58 | 1.10 | 9.4 GB |
7-8B LoRA at seq 2048 is tight – batch 1 with gradient accumulation is the practical configuration. Phi-3-mini and smaller models have plenty of headroom.
LoRA vs QLoRA (Llama 3 8B)
| Metric | LoRA FP16 | QLoRA 4-bit |
|---|---|---|
| Peak VRAM | 11.5 GB | 11.8 GB (more batch possible) |
| Tokens/s @ seq 2048 | ~4,600 | ~4,900 |
| Max seq at bs=2 | 2048 | 4096 |
| Eval loss delta vs full FT | +1.2% | +2.4% |
| Setup simplicity | Simpler | Needs bitsandbytes |
QLoRA is marginally faster on this card because memory pressure is lower, freeing the GPU to scale batch. LoRA has better quality and simpler tooling.
When to Use Each
- LoRA: small dataset (<5k samples), quality matters more than speed, smaller models (≤8B)
- QLoRA: larger models (14B+), bigger datasets where iteration speed matters, constrained VRAM
- Unsloth QLoRA: overrides both on throughput – usually the right default
LoRA Training on Blackwell 16GB
Llama 3 8B LoRA in ~0.4 s/step at seq 2048. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: LoRA guide, QLoRA speed, Unsloth, fine-tune throughput, QLoRA guide.