RTX 3050 - Order Now
Home / Blog / Tutorials / LoRA vs QLoRA vs Full Fine-Tuning: GPU Requirements
Tutorials

LoRA vs QLoRA vs Full Fine-Tuning: GPU Requirements

A practical comparison of LoRA, QLoRA, and full fine-tuning GPU requirements for 7B to 70B models, covering VRAM, speed, quality, and cost trade-offs.

Three Approaches to Fine-Tuning

When fine-tuning an LLM on a dedicated GPU server, you have three main approaches. Full fine-tuning updates every parameter but requires enormous VRAM. LoRA (Low-Rank Adaptation) trains small adapter matrices alongside frozen base weights, cutting VRAM by 60-80%. QLoRA adds 4-bit quantisation of the base model, reducing VRAM by 85-90%. This guide compares all three across multiple model sizes.

For model-specific guides, see our tutorials for LLaMA 3 8B, Mistral 7B, and DeepSeek. For the overall GPU selection, check the best GPU for fine-tuning LLMs guide.

VRAM Comparison by Model Size

The table below shows VRAM requirements for each method at sequence length 512, batch size 4 (or gradient-accumulated equivalent). LoRA and QLoRA use rank 16 targeting attention layers.

Model SizeFull FT (FP16)LoRA (FP16)QLoRA (INT4)
7B (Mistral 7B)~72 GB~26 GB~13 GB
8B (LLaMA 3 8B)~80 GB~28 GB~14 GB
13-14B~140 GB~45 GB~22 GB
70B (LLaMA 3 70B)~700 GB~180 GB~52 GB
72B (Qwen 2.5 72B)~720 GB~185 GB~54 GB

Key takeaways:

  • Full fine-tuning: requires 8-10x the model weight size due to optimiser states (Adam stores 2 copies of gradients) and activations. A 7B model needs ~72 GB — beyond any single consumer GPU.
  • LoRA: eliminates optimiser states for most parameters. A 7B model fits on a single RTX 3090 (24 GB) with headroom.
  • QLoRA: quantises the frozen base model to INT4, cutting weight memory by 75%. A 7B model needs only ~13 GB, fitting on budget 16 GB GPUs. Even 70B models become trainable on 2-3 consumer GPUs.

For precise estimates at your exact configuration, use our fine-tuning VRAM calculator.

Training Speed Comparison

Training speed on an RTX 3090 (24 GB) for a 7B model, 1K examples, sequence length 512, 3 epochs.

MethodTokens/sec (training)Time for 1K ExamplesRelative Speed
Full fine-tuningN/A on single GPUN/A
LoRA (FP16, r=16)~850~32 min1.0x (baseline)
QLoRA (INT4, r=16)~720~24 min0.85x throughput, faster wall-clock

QLoRA has slightly lower tokens-per-second throughput than LoRA (due to INT4 dequantisation overhead) but trains faster in wall-clock time because it can use larger batch sizes within the same VRAM. For timing across more GPUs, see our fine-tuning time by GPU benchmarks.

Quality Differences

We compared the three methods on a standardised instruction-following benchmark after fine-tuning LLaMA 3 8B on 10K examples.

MethodTask AccuracyReasoning QualityInstruction Following
Full fine-tuning92%91%94%
LoRA (r=16)90%89%92%
LoRA (r=64)91%90%93%
QLoRA (r=16)89%88%91%
QLoRA (r=64)90%89%92%

Full fine-tuning sets the ceiling, but LoRA and QLoRA come within 1-3% on most metrics. Increasing LoRA rank from 16 to 64 recovers roughly half the gap. For most production use cases, the quality difference is not significant enough to justify the dramatically higher hardware costs of full fine-tuning.

Decision Guide

  • Use QLoRA when: budget is limited, VRAM is under 24 GB, you need to fine-tune 70B+ models, or you are iterating quickly on multiple experiments. Best for most users.
  • Use LoRA (FP16) when: you have 24+ GB VRAM, quality on nuanced tasks is critical, or you plan to merge and redistribute the adapter. Marginal quality improvement over QLoRA.
  • Use full fine-tuning when: you have RTX 6000 Pro/RTX 6000 Pro clusters, need maximum quality, or are making fundamental behaviour changes to the model. Required for multi-GPU cluster setups with FSDP or DeepSpeed.

After training, deploy your fine-tuned model with GPTQ or AWQ quantisation for fast inference on your LLM hosting server. Browse all tutorials in the Tutorials category.

Conclusion

QLoRA is the right default for most fine-tuning tasks — it delivers 97-99% of full fine-tuning quality at 10-20% of the VRAM cost. LoRA at FP16 offers a modest quality bump for users with 24+ GB GPUs. Full fine-tuning is reserved for large-scale, high-budget projects. Match your method to your GPU budget, and you will get excellent results without overspending on GPU hosting.

Start Fine-Tuning on Dedicated GPUs

From 16 GB budget GPUs for QLoRA to multi-GPU clusters for full fine-tuning. PyTorch and PEFT pre-installed.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?