Where QLoRA is VRAM-efficient and slow, plain LoRA at FP16 is fast and readable. For a 7B model on a 3090 or 5090 from our dedicated hosting, LoRA is the right default.
Contents
Why LoRA Over QLoRA
QLoRA is necessary when base weights would not fit in VRAM. Mistral 7B at FP16 is 14 GB, and you can fit the model, activations, gradients, and optimiser state in a 24 GB card. Skipping the 4-bit quantisation step means:
- Faster training (no dequantise-compute-quantise loop)
- Slightly better fine-tune quality (BF16 compute instead of quantised base)
- Simpler stack
Memory Budget
| Component | ~VRAM |
|---|---|
| Base weights (FP16) | 14 GB |
| LoRA trainable | ~0.3 GB |
| Optimiser state (AdamW 8-bit) | ~2 GB |
| Activations + gradients | 4-6 GB |
| Total | ~20-22 GB |
Training
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
torch_dtype="bfloat16",
device_map="cuda",
)
lora = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"])
model = get_peft_model(model, lora)
trainer = SFTTrainer(
model=model,
args=SFTConfig(
output_dir="./out",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=1e-4,
bf16=True,
num_train_epochs=3,
),
train_dataset=your_dataset,
)
trainer.train()
Serving the Adapter
Two options. Merge into a new checkpoint and serve with vLLM:
merged = model.merge_and_unload()
merged.save_pretrained("./mistral-7b-myfinetune")
Or keep as a LoRA adapter and serve via LoRAX multi-LoRA serving – better if you have many small fine-tunes.
Fine-Tuning and Serving on One Server
UK dedicated GPUs sized for LoRA training and production inference.
Browse GPU ServersSee QLoRA on Llama 3.3 70B and Unsloth.