RTX 3050 - Order Now
Home / Blog / Tutorials / QLoRA on RTX 4090 24GB: Fine-Tune Llama 3 70B in 24 GB
Tutorials

QLoRA on RTX 4090 24GB: Fine-Tune Llama 3 70B in 24 GB

Production QLoRA recipes for the RTX 4090 24GB, including the full Llama 3.1 70B run with NF4, paged AdamW 8-bit, gradient checkpointing and verified throughput.

QLoRA is the trick that lets a single RTX 4090 24GB dedicated server fine-tune models that would normally demand an H100 80 GB or a multi-card A100 box. By quantising the frozen base weights to NF4 and training only a small BF16 LoRA adapter, the 24 GB of GDDR6X on the Ada AD102 stops being a hard ceiling. The economic point is simple: you pay £550/month for a dedicated card and complete a Llama 3.1 70B adapter run in under a fortnight, where renting H100 hours for the same job costs $1,500-3,000. This guide gives the exact recipes we run on GigaGPU dedicated boxes for everything from a 7B finetune to the headline 70B fit.

Contents

Why QLoRA fits the 4090

QLoRA stores the frozen base weights in NF4 (about 4.5 bits effective once double-quantisation overhead is accounted for), keeps an 8-bit paged AdamW optimiser, and trains LoRA adapters in BF16. On a 4090 the 1008 GB/s GDDR6X bandwidth keeps NF4 dequantisation fed during the forward pass, and the 165 TFLOPS of BF16 tensor throughput handles the adapter math comfortably. The paged optimiser is the load-bearing piece: bitsandbytes pages optimiser state to CPU memory between micro-batches so that what would be a 14 GB AdamW state for a 70B model collapses to roughly 3.6 GB resident.

ModelNF4 baseAdapter (r=64)Optimiser (paged)Activations (seq 1024-4096)Total VRAM
Llama 3.1 8B4.5 GB0.13 GB0.6 GB3.5 GB @ 4096~9-12 GB
Mistral Small 22B11.8 GB0.31 GB1.4 GB2.0 GB @ 2048~17 GB
Qwen 2.5 32B17.4 GB0.42 GB1.9 GB1.5 GB @ 2048~21.8 GB
Llama 3.1 70B18.9 GB0.84 GB3.6 GB~0.9 GB @ 1024~23.4 GB

Memory budget on 24 GB

The 70B fit is real but tight. Activations swallow whatever VRAM you do not pre-allocate, so gradient checkpointing is mandatory, sequence length is capped at 1024 for the 70B, and per-device batch is fixed at 1 with gradient accumulation handling the effective batch. The 8B and 14B sit comfortably at sequence length 4096 with batch 4-8, leaving 5-8 GB free.

If you are seeing OOM on the 70B run, the order of escalating fixes is: drop sequence length to 768, then drop LoRA rank to 8, then drop target modules to q_proj, v_proj only. Each step costs adapter quality but recovers 1-3 GB. For broader scaling references compare the Llama 70B INT4 deployment footprint and the 70B INT4 benchmark.

Environment and dependencies

Standardise on CUDA 12.4, PyTorch 2.4, bitsandbytes 0.44, transformers 4.45, peft 0.13 and trl 0.11. The bitsandbytes 0.44 release shipped the paged AdamW kernel that makes the 70B fit reliably; older versions intermittently OOM on the optimiser allocation.

pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install transformers==4.45.2 peft==0.13.0 trl==0.11.4 \
            bitsandbytes==0.44.1 accelerate==1.0.1 datasets==3.0.1 \
            flash-attn==2.6.3 --no-build-isolation
export TORCH_CUDA_ARCH_LIST=8.9
export HF_TOKEN=hf_yourtoken
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

expandable_segments:True reduces fragmentation on long runs, which matters when paged AdamW is moving optimiser state in and out of GPU memory thousands of times.

Hyperparameters that work

Knob8B14B32B70B
LoRA r64643216
LoRA alpha1281286432
LoRA dropout0.050.050.050.05
Target modulesall-linearall-linearall-linearq,k,v,o + MLP
Sequence length4096409620481024 (push to 2048 if fits)
Per-device batch8421
Grad accum2488
Effective batch1616168
Learning rate2e-42e-41.5e-41e-4
Warmup ratio0.030.030.050.05
Max grad norm1.01.00.50.3
Optimiserpaged_adamw_8bitpaged_adamw_8bitpaged_adamw_8bitpaged_adamw_8bit

The 70B is genuinely fiddly: keep max_grad_norm=0.3 to suppress the occasional spike that comes from training a tiny adapter on a giant model, and resist the urge to push LR above 1e-4 even though the loss falls slowly at first.

Llama 70B QLoRA training script

This is the headline recipe. NF4 with double quantisation, FlashAttention 2, gradient checkpointing, paged AdamW 8-bit. Save as train_70b_qlora.py.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

bnb_config = BitsAndBytesConfig(
  load_in_4bit=True, bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype="bfloat16", bnb_4bit_use_double_quant=True,
)

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
tok.pad_token = tok.eos_token

model = AutoModelForCausalLM.from_pretrained(
  "meta-llama/Llama-3.1-70B-Instruct",
  quantization_config=bnb_config, device_map="auto",
  attn_implementation="flash_attention_2",
)
model = prepare_model_for_kbit_training(model)
model.gradient_checkpointing_enable()

peft_config = LoraConfig(r=16, lora_alpha=32,
  target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
  bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, peft_config)

ds = load_dataset("json", data_files="train.jsonl", split="train")

# Paged AdamW essential for VRAM
training_args = SFTConfig(
  output_dir="./70b-lora", num_train_epochs=2,
  per_device_train_batch_size=1, gradient_accumulation_steps=8,
  learning_rate=1e-4, bf16=True, optim="paged_adamw_8bit",
  gradient_checkpointing=True,
  gradient_checkpointing_kwargs={"use_reentrant": False},
  max_grad_norm=0.3, max_seq_length=1024,
  logging_steps=5, save_steps=100, save_total_limit=3,
  warmup_ratio=0.05, lr_scheduler_type="cosine",
  report_to="tensorboard",
)

trainer = SFTTrainer(model=model, tokenizer=tok, train_dataset=ds,
                     peft_config=peft_config, args=training_args)
trainer.train()
trainer.model.save_pretrained("./70b-lora/adapter")

Throughput on this configuration is approximately 1,800 training tokens per second. A 50,000-sample dataset at 2048-token average length completes a 2-epoch run in roughly 16 hours of wall clock; the same job on an H100 80 GB at ~6,500 tok/s with full BF16 finishes in 4 hours but costs 5-6x as much in cloud-rental hours.

Run, monitor and verify

Launch with logging redirected and a separate monitoring shell.

nohup python train_70b_qlora.py > train.log 2>&1 &
nvtop
watch -n 5 'tail -n 20 train.log'
tensorboard --logdir 70b-lora --port 6006 --bind_all
Signal8B QLoRA32B QLoRA70B QLoRA
VRAM~12 GB~22 GB~23.4 GB
GPU utilisation96-99%96-99%92-97%
Power~410 W~420 W~430 W
Throughput~3,800 tok/s~820 tok/s~1,800 tok/s
Loss after 100 steps1.4-1.81.5-1.91.6-2.0

If GPU utilisation sits below 90% on the 70B, the bottleneck is almost always paged optimiser eviction; check that PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True is exported and that the host has at least 96 GB system RAM available for the optimiser swap.

Wall-clock and throughput

Times below are measured on the GigaGPU 4090 tier with NVMe-backed dataset, 50,000 samples averaging 1.2k tokens.

ModelTokens/secTime per epoch2-epoch wall clock3-epoch wall clock
Llama 3.1 8B3,8004.4 h~9 h~13 h
Qwen 2.5 14B2,1007.9 h~16 h~24 h
Qwen 2.5 32B82020 h~40 h~60 h
Llama 3.1 70B1,800~9 h*~16 h*~25 h*

* The 70B QLoRA runs faster than the 32B in this table because we drop sequence length to 1024 and rank to 16. If you push the 70B to seq 2048 throughput halves to ~900 tok/s and the 2-epoch run becomes a four-day job. Either way, dedicated hosting beats per-hour cloud rentals decisively for runs of this length; see the 12-month ROI analysis.

Production gotchas

  1. NF4 quantisation drift: NF4 is lossy; expect a 0.5-1.5 point regression on MMLU versus a full BF16 fine-tune of the same data. For most chat and tool-use tasks this is invisible, but for strict factual eval workloads benchmark before deploying.
  2. Adapter merge: merge_and_unload() on a QLoRA model dequantises to BF16 first, briefly spiking VRAM by 30+ GB and OOMing on the 4090. Merge on CPU with model.to("cpu").merge_and_unload() for the 70B.
  3. Resume from checkpoint: bitsandbytes optimiser state does not always restore cleanly across version bumps. Pin your environment for the entire run.
  4. Checkpoint disk usage: a 70B QLoRA checkpoint with full optimiser state is ~7 GB. save_total_limit=3 prevents 2 TB NVMe filling overnight.
  5. Flash-attn build with mismatched CUDA: install flash-attn with the same CUDA toolkit your PyTorch wheel was built against. Mixing 11.8 and 12.4 produces silent NaN losses on the first backward pass.
  6. System RAM under-provisioned: paged AdamW pages to host RAM. The 70B run needs at least 96 GB free or the swap path falls back to disk and throughput drops 5x. If your box ships with 64 GB only, drop to 32B or order the larger RAM tier; see monthly hosting cost.
  7. Thermal throttling: a 70B QLoRA run pulls 430 W for 12+ hours straight. Verify thermal performance ahead of time; a hot-spot above 92 C will throttle clocks and add hours to the wall clock.

Verdict

QLoRA on a 4090 is the cheapest credible way to put a 70B fine-tune through training right now. NF4 plus paged AdamW 8-bit lets a single Ada card train models that previously needed an A100 80 GB or H100, at a wall clock of 16-25 hours for a 2-3 epoch run on 50k samples. For 7B-32B QLoRA it is even more comfortable, finishing in single-digit hours. If you need full BF16 fine-tuning of 70B-class models, you are looking at multi-card or H100 rentals; for everything else, the 4090 is the right unit. Compare to the standard LoRA fine-tune guide when your model fits in BF16, and look at the FP8 deployment guide for serving the merged result.

Train a 70B adapter on a single GPU

QLoRA on the RTX 4090 24GB at a flat monthly rate. UK dedicated hosting.

Order the RTX 4090 24GB

See also: LoRA fine-tune guide, fine-tune throughput, AWQ quantisation guide, Llama 70B INT4 deployment, vLLM setup, spec breakdown, fine-tune box use case, multi-card pairing.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?