Home / Blog / Tutorials / QLoRA Fine-Tune on RTX 5060 Ti 16GB – Complete Guide

Tutorials

QLoRA Fine-Tune on RTX 5060 Ti 16GB – Complete Guide

QLoRA with bitsandbytes NF4 lets you fine-tune up to 14 B parameters on a 16 GB card - code, config and timing.

Tutorials April 23, 2026 2 min read admin

QLoRA quantises the frozen base model to 4-bit NF4 during training while keeping the trainable LoRA adapters in BF16. That buys roughly 4x the memory headroom over plain LoRA and makes 14B-class models trainable on a 16 GB card. On the RTX 5060 Ti 16GB via our dedicated GPU hosting, QLoRA on Qwen 2.5 14B comfortably fits and finishes overnight.

What QLoRA changes
VRAM math
Config
Training code
Expected time
QLoRA vs LoRA

What QLoRA Changes

Three pieces compared to plain LoRA:

4-bit NF4 base weights via bitsandbytes – a lossy but surprisingly effective quantisation format
Double quantisation – quantisation constants themselves quantised, shaving another ~0.4 GB on a 14B
Paged optimiser – AdamW state swaps to host RAM on OOM pressure, rare on a 16 GB card with LoRA-only optimiser state

VRAM Math – Qwen 2.5 14B

Component	Plain LoRA	QLoRA
Base weights	~28 GB (BF16)	~7.5 GB (4-bit NF4)
LoRA adapter	~40 MB	~40 MB
Optimiser state	~160 MB	~160 MB
Gradients	~80 MB	~80 MB
Activations (bs 1, 4k, checkpointed)	~3 GB	~3 GB
Buffer and kernels	~1 GB	~1 GB
Peak VRAM	~32.3 GB (won’t fit)	~11.8 GB (fits 16 GB)

Config

Parameter	Recommended
Base model	Qwen 2.5 14B Instruct, Llama 3.1 8B, Mistral Nemo 12B
Quantisation	4-bit NF4, double-quant enabled
Compute dtype	bfloat16
LoRA r / alpha	16 / 32
Target modules	q, k, v, o, gate, up, down
Max seq length	4096 (2048 for 14B if tight)
Batch size	1-2, grad accum 8-16 (effective 16)
Learning rate	1e-4 for 14B, 2e-4 for 7B
Gradient checkpointing	Unsloth variant
Optimiser	paged_adamw_8bit

Training Code

from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig

model, tok = FastLanguageModel.from_pretrained(
    "unsloth/Qwen2.5-14B-Instruct",
    max_seq_length=4096,
    load_in_4bit=True,       # NF4
    dtype="bfloat16",
)

model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

SFTTrainer(
    model=model, tokenizer=tok,
    train_dataset=train_ds, eval_dataset=eval_ds,
    args=SFTConfig(
        output_dir="./qwen14b-qlora",
        num_train_epochs=3,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=16,
        learning_rate=1e-4,
        bf16=True,
        optim="paged_adamw_8bit",
        logging_steps=10, eval_strategy="epoch",
    ),
).train()

Expected Time

Model	Dataset	Tokens/sec	Time per epoch (2 M tokens)
Llama 3.1 8B	2,000 ex	~4,100	~8 min
Mistral Nemo 12B	2,000 ex	~3,100	~11 min
Qwen 2.5 14B	2,000 ex	~2,600	~13 min
Qwen 2.5 14B	20,000 ex (20 M tokens)	~2,600	~2 h per epoch, ~6 h for 3 epochs

QLoRA vs LoRA – When to Use Which

QLoRA – when you need to fine-tune a larger base than VRAM allows, or when you want headroom for longer sequences / bigger batches
LoRA – when the base fits in BF16 and you want maximum training quality and speed (NF4 adds slight noise)
Quality delta – typically <1% eval-loss difference; QLoRA is production-acceptable for almost all tasks
Merge caveat – QLoRA adapters merge back into a BF16 base; the NF4 quantisation is training-time only

QLoRA on Blackwell 16 GB

Fine-tune up to Qwen 14B overnight on one card. UK dedicated hosting.

Order the RTX 5060 Ti 16GB

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

QLoRA Fine-Tune on RTX 5060 Ti 16GB – Complete Guide

Contents

What QLoRA Changes

VRAM Math – Qwen 2.5 14B

Config

Training Code

Expected Time

QLoRA vs LoRA – When to Use Which

QLoRA on Blackwell 16 GB

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

QLoRA Fine-Tune on RTX 5060 Ti 16GB – Complete Guide

Contents

What QLoRA Changes

VRAM Math – Qwen 2.5 14B

Config

Training Code

Expected Time

QLoRA vs LoRA – When to Use Which

QLoRA on Blackwell 16 GB

Need a Dedicated GPU Server?

admin

Related Articles

ExLlamaV2 Hosting on RTX 5060 Ti 16GB

FastAPI vs Flask for AI Inference APIs

Migrate from RunPod to Dedicated GPU: Multi-Model Serving

Connect Notion to Self-Hosted AI on GPU

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?