QLoRA quantises the frozen base model to 4-bit NF4 during training while keeping the trainable LoRA adapters in BF16. That buys roughly 4x the memory headroom over plain LoRA and makes 14B-class models trainable on a 16 GB card. On the RTX 5060 Ti 16GB via our dedicated GPU hosting, QLoRA on Qwen 2.5 14B comfortably fits and finishes overnight.
Contents
What QLoRA Changes
Three pieces compared to plain LoRA:
- 4-bit NF4 base weights via bitsandbytes – a lossy but surprisingly effective quantisation format
- Double quantisation – quantisation constants themselves quantised, shaving another ~0.4 GB on a 14B
- Paged optimiser – AdamW state swaps to host RAM on OOM pressure, rare on a 16 GB card with LoRA-only optimiser state
VRAM Math – Qwen 2.5 14B
| Component | Plain LoRA | QLoRA |
|---|---|---|
| Base weights | ~28 GB (BF16) | ~7.5 GB (4-bit NF4) |
| LoRA adapter | ~40 MB | ~40 MB |
| Optimiser state | ~160 MB | ~160 MB |
| Gradients | ~80 MB | ~80 MB |
| Activations (bs 1, 4k, checkpointed) | ~3 GB | ~3 GB |
| Buffer and kernels | ~1 GB | ~1 GB |
| Peak VRAM | ~32.3 GB (won’t fit) | ~11.8 GB (fits 16 GB) |
Config
| Parameter | Recommended |
|---|---|
| Base model | Qwen 2.5 14B Instruct, Llama 3.1 8B, Mistral Nemo 12B |
| Quantisation | 4-bit NF4, double-quant enabled |
| Compute dtype | bfloat16 |
| LoRA r / alpha | 16 / 32 |
| Target modules | q, k, v, o, gate, up, down |
| Max seq length | 4096 (2048 for 14B if tight) |
| Batch size | 1-2, grad accum 8-16 (effective 16) |
| Learning rate | 1e-4 for 14B, 2e-4 for 7B |
| Gradient checkpointing | Unsloth variant |
| Optimiser | paged_adamw_8bit |
Training Code
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
model, tok = FastLanguageModel.from_pretrained(
"unsloth/Qwen2.5-14B-Instruct",
max_seq_length=4096,
load_in_4bit=True, # NF4
dtype="bfloat16",
)
model = FastLanguageModel.get_peft_model(
model, r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
use_gradient_checkpointing="unsloth",
random_state=42,
)
SFTTrainer(
model=model, tokenizer=tok,
train_dataset=train_ds, eval_dataset=eval_ds,
args=SFTConfig(
output_dir="./qwen14b-qlora",
num_train_epochs=3,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=1e-4,
bf16=True,
optim="paged_adamw_8bit",
logging_steps=10, eval_strategy="epoch",
),
).train()
Expected Time
| Model | Dataset | Tokens/sec | Time per epoch (2 M tokens) |
|---|---|---|---|
| Llama 3.1 8B | 2,000 ex | ~4,100 | ~8 min |
| Mistral Nemo 12B | 2,000 ex | ~3,100 | ~11 min |
| Qwen 2.5 14B | 2,000 ex | ~2,600 | ~13 min |
| Qwen 2.5 14B | 20,000 ex (20 M tokens) | ~2,600 | ~2 h per epoch, ~6 h for 3 epochs |
QLoRA vs LoRA – When to Use Which
- QLoRA – when you need to fine-tune a larger base than VRAM allows, or when you want headroom for longer sequences / bigger batches
- LoRA – when the base fits in BF16 and you want maximum training quality and speed (NF4 adds slight noise)
- Quality delta – typically <1% eval-loss difference; QLoRA is production-acceptable for almost all tasks
- Merge caveat – QLoRA adapters merge back into a BF16 base; the NF4 quantisation is training-time only
QLoRA on Blackwell 16 GB
Fine-tune up to Qwen 14B overnight on one card. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: QLoRA training speed, LoRA guide, Unsloth speed, fine-tune throughput, vLLM setup.