RTX 3050 - Order Now
Home / Blog / Tutorials / LoRA Fine-Tuning Mistral 7B on a Dedicated GPU
Tutorials

LoRA Fine-Tuning Mistral 7B on a Dedicated GPU

LoRA at FP16 works comfortably on a 24GB GPU for Mistral 7B - the fastest practical path to a fine-tuned model you can serve.

Where QLoRA is VRAM-efficient and slow, plain LoRA at FP16 is fast and readable. For a 7B model on a 3090 or 5090 from our dedicated hosting, LoRA is the right default.

Contents

Why LoRA Over QLoRA

QLoRA is necessary when base weights would not fit in VRAM. Mistral 7B at FP16 is 14 GB, and you can fit the model, activations, gradients, and optimiser state in a 24 GB card. Skipping the 4-bit quantisation step means:

  • Faster training (no dequantise-compute-quantise loop)
  • Slightly better fine-tune quality (BF16 compute instead of quantised base)
  • Simpler stack

Memory Budget

Component~VRAM
Base weights (FP16)14 GB
LoRA trainable~0.3 GB
Optimiser state (AdamW 8-bit)~2 GB
Activations + gradients4-6 GB
Total~20-22 GB

Training

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    torch_dtype="bfloat16",
    device_map="cuda",
)
lora = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"])
model = get_peft_model(model, lora)

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(
        output_dir="./out",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=1e-4,
        bf16=True,
        num_train_epochs=3,
    ),
    train_dataset=your_dataset,
)
trainer.train()

Serving the Adapter

Two options. Merge into a new checkpoint and serve with vLLM:

merged = model.merge_and_unload()
merged.save_pretrained("./mistral-7b-myfinetune")

Or keep as a LoRA adapter and serve via LoRAX multi-LoRA serving – better if you have many small fine-tunes.

Fine-Tuning and Serving on One Server

UK dedicated GPUs sized for LoRA training and production inference.

Browse GPU Servers

See QLoRA on Llama 3.3 70B and Unsloth.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?