QLoRA freezes quantised weights and trains low-rank adapters. It’s the realistic way to fine-tune 7B-14B models on 16 GB. Measured numbers on the RTX 5060 Ti 16GB at our hosting:
Contents
Stack
- Transformers 4.46, PEFT 0.13, bitsandbytes 0.44, Accelerate 1.0
- 4-bit NF4 quantisation, double quant enabled, bf16 compute dtype
- Paged AdamW optimiser (bnb_8bit)
- FlashAttention 2.6
Per-Step Timings (sec per iteration)
| Model | Seq 1024 | Seq 2048 | Seq 4096 |
|---|---|---|---|
| Mistral 7B (bs=4) | 0.78 | 1.55 | 3.20 |
| Llama 3 8B (bs=4) | 0.85 | 1.68 | 3.45 |
| Gemma 2 9B (bs=2) | 0.62 | 1.30 | 2.80 |
| Qwen 2.5 14B (bs=2) | 1.05 | 2.10 | OOM |
Tokens/sec
| Model | Config | tokens/s |
|---|---|---|
| Llama 3 8B | bs=4, seq=2048 | 4,900 |
| Mistral 7B | bs=4, seq=2048 | 5,300 |
| Qwen 2.5 14B | bs=2, seq=2048 | 1,950 |
Memory Usage
| Model | Config | Peak VRAM |
|---|---|---|
| Llama 3 8B | bs=4, seq=2048 | 11.8 GB |
| Llama 3 8B | bs=2, seq=4096 | 13.2 GB |
| Qwen 2.5 14B | bs=2, seq=2048 | 14.5 GB |
Recommended Recipe
- Llama 3 8B at seq 2048, bs 4 – fits with 4 GB headroom, ~5k tokens/s
- Enable gradient checkpointing if pushing to seq 4096
- Effective batch size via
gradient_accumulation_steps=4to reach eff_bs=16 - Use Paged AdamW to avoid OOM on longer sequences
For higher throughput, switch to Unsloth – same results in roughly half the time.
QLoRA on Blackwell 16GB
Llama 3 8B at ~5k tokens/s. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: QLoRA guide, LoRA speed, Unsloth, fine-tune throughput, LoRA guide.