Table of Contents
For training jobs the right benchmark is fine-tuning tokens-per-second — the rate at which the model processes training data. Higher is better; lower means longer training. The 5060 Ti’s 16 GB constrains which methods are viable, but within that envelope it has solid throughput.
QLoRA on Llama 3.1 8B at the 5060 Ti hits ~3,200 fine-tuning tok/s with batch size 4. LoRA-FP8 hits ~2,800. Full SFT does not fit any 7B+ model on 16 GB. For Phi-3 Mini, QLoRA hits ~7,500 tok/s. Wall time for a typical 10K-sample SFT dataset is ~6 hours.
Methods compared
| Method | 7B model peak VRAM | Fine-tune tok/s |
|---|---|---|
| Full SFT (BF16) | ~80 GB | Does not fit |
| LoRA (BF16 base) | ~24 GB | Does not fit on 16 GB |
| LoRA (FP8 base) | ~14 GB | ~2,800 |
| QLoRA (NF4 base) | ~12 GB | ~3,200 |
| DoRA (NF4 + magnitude) | ~12.5 GB | ~2,400 (slightly slower) |
Throughput by model size
| Model | Method | Tokens per second | Wall time for 10K samples (2K seq) |
|---|---|---|---|
| Phi-3 Mini | QLoRA r=64 | ~7,500 | ~3 hours |
| Mistral 7B | QLoRA r=64 | ~3,400 | ~6 hours |
| Llama 3.1 8B | QLoRA r=64 | ~3,200 | ~6 hours |
| Qwen 2.5 7B | QLoRA r=64 | ~3,500 | ~6 hours |
| Gemma 2 9B | QLoRA r=64 | ~2,900 | ~7 hours |
Optimizer impact
- paged_adamw_8bit — default, saves ~3 GB vs full AdamW. Use this.
- adafactor — slightly less VRAM than 8-bit AdamW, slower convergence.
- full AdamW — does not fit 7B models on 16 GB.
- Lion — newer, slightly faster than AdamW. Less battle-tested.
Verdict
The 5060 Ti is a credible fine-tuning host for 7B–8B models with QLoRA. ~6 hours for a typical SFT job, peak VRAM ~12 GB. For 13B+ models or full SFT, step up to a 5090 or 6000 Pro.
Bottom line
For overnight QLoRA fine-tuning of 7B models, the 5060 Ti is the cheapest credible card. For deeper hyperparameter guidance see QLoRA fine-tune guide.