QLoRA on RTX 4090 24GB: Fine-Tune Llama 3 70B in 24 GB GIGAGPU

QLoRA is the trick that lets a single RTX 4090 24GB dedicated server fine-tune models that would normally demand an H100 80 GB or a multi-card A100 box. By quantising the frozen base weights to NF4 and training only a small BF16 LoRA adapter, the 24 GB of GDDR6X on the Ada AD102 stops being a hard ceiling. The economic point is simple: you pay £550/month for a dedicated card and complete a Llama 3.1 70B adapter run in under a fortnight, where renting H100 hours for the same job costs $1,500-3,000. This guide gives the exact recipes we run on GigaGPU dedicated boxes for everything from a 7B finetune to the headline 70B fit.

Why QLoRA fits the 4090

QLoRA stores the frozen base weights in NF4 (about 4.5 bits effective once double-quantisation overhead is accounted for), keeps an 8-bit paged AdamW optimiser, and trains LoRA adapters in BF16. On a 4090 the 1008 GB/s GDDR6X bandwidth keeps NF4 dequantisation fed during the forward pass, and the 165 TFLOPS of BF16 tensor throughput handles the adapter math comfortably. The paged optimiser is the load-bearing piece: bitsandbytes pages optimiser state to CPU memory between micro-batches so that what would be a 14 GB AdamW state for a 70B model collapses to roughly 3.6 GB resident.

Model	NF4 base	Adapter (r=64)	Optimiser (paged)	Activations (seq 1024-4096)	Total VRAM
Llama 3.1 8B	4.5 GB	0.13 GB	0.6 GB	3.5 GB @ 4096	~9-12 GB
Mistral Small 22B	11.8 GB	0.31 GB	1.4 GB	2.0 GB @ 2048	~17 GB
Qwen 2.5 32B	17.4 GB	0.42 GB	1.9 GB	1.5 GB @ 2048	~21.8 GB
Llama 3.1 70B	18.9 GB	0.84 GB	3.6 GB	~0.9 GB @ 1024	~23.4 GB

Memory budget on 24 GB

The 70B fit is real but tight. Activations swallow whatever VRAM you do not pre-allocate, so gradient checkpointing is mandatory, sequence length is capped at 1024 for the 70B, and per-device batch is fixed at 1 with gradient accumulation handling the effective batch. The 8B and 14B sit comfortably at sequence length 4096 with batch 4-8, leaving 5-8 GB free.

If you are seeing OOM on the 70B run, the order of escalating fixes is: drop sequence length to 768, then drop LoRA rank to 8, then drop target modules to q_proj, v_proj only. Each step costs adapter quality but recovers 1-3 GB. For broader scaling references compare the Llama 70B INT4 deployment footprint and the 70B INT4 benchmark.

Environment and dependencies

Standardise on CUDA 12.4, PyTorch 2.4, bitsandbytes 0.44, transformers 4.45, peft 0.13 and trl 0.11. The bitsandbytes 0.44 release shipped the paged AdamW kernel that makes the 70B fit reliably; older versions intermittently OOM on the optimiser allocation.

pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install transformers==4.45.2 peft==0.13.0 trl==0.11.4 \
            bitsandbytes==0.44.1 accelerate==1.0.1 datasets==3.0.1 \
            flash-attn==2.6.3 --no-build-isolation
export TORCH_CUDA_ARCH_LIST=8.9
export HF_TOKEN=hf_yourtoken
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

expandable_segments:True reduces fragmentation on long runs, which matters when paged AdamW is moving optimiser state in and out of GPU memory thousands of times.

Hyperparameters that work

Knob	8B	14B	32B	70B
LoRA r	64	64	32	16
LoRA alpha	128	128	64	32
LoRA dropout	0.05	0.05	0.05	0.05
Target modules	all-linear	all-linear	all-linear	q,k,v,o + MLP
Sequence length	4096	4096	2048	1024 (push to 2048 if fits)
Per-device batch	8	4	2	1
Grad accum	2	4	8	8
Effective batch	16	16	16	8
Learning rate	2e-4	2e-4	1.5e-4	1e-4
Warmup ratio	0.03	0.03	0.05	0.05
Max grad norm	1.0	1.0	0.5	0.3
Optimiser	paged_adamw_8bit	paged_adamw_8bit	paged_adamw_8bit	paged_adamw_8bit

The 70B is genuinely fiddly: keep max_grad_norm=0.3 to suppress the occasional spike that comes from training a tiny adapter on a giant model, and resist the urge to push LR above 1e-4 even though the loss falls slowly at first.

Llama 70B QLoRA training script

This is the headline recipe. NF4 with double quantisation, FlashAttention 2, gradient checkpointing, paged AdamW 8-bit. Save as train_70b_qlora.py.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

bnb_config = BitsAndBytesConfig(
  load_in_4bit=True, bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype="bfloat16", bnb_4bit_use_double_quant=True,
)

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
tok.pad_token = tok.eos_token

model = AutoModelForCausalLM.from_pretrained(
  "meta-llama/Llama-3.1-70B-Instruct",
  quantization_config=bnb_config, device_map="auto",
  attn_implementation="flash_attention_2",
)
model = prepare_model_for_kbit_training(model)
model.gradient_checkpointing_enable()

peft_config = LoraConfig(r=16, lora_alpha=32,
  target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
  bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, peft_config)

ds = load_dataset("json", data_files="train.jsonl", split="train")

# Paged AdamW essential for VRAM
training_args = SFTConfig(
  output_dir="./70b-lora", num_train_epochs=2,
  per_device_train_batch_size=1, gradient_accumulation_steps=8,
  learning_rate=1e-4, bf16=True, optim="paged_adamw_8bit",
  gradient_checkpointing=True,
  gradient_checkpointing_kwargs={"use_reentrant": False},
  max_grad_norm=0.3, max_seq_length=1024,
  logging_steps=5, save_steps=100, save_total_limit=3,
  warmup_ratio=0.05, lr_scheduler_type="cosine",
  report_to="tensorboard",
)

trainer = SFTTrainer(model=model, tokenizer=tok, train_dataset=ds,
                     peft_config=peft_config, args=training_args)
trainer.train()
trainer.model.save_pretrained("./70b-lora/adapter")

Throughput on this configuration is approximately 1,800 training tokens per second. A 50,000-sample dataset at 2048-token average length completes a 2-epoch run in roughly 16 hours of wall clock; the same job on an H100 80 GB at ~6,500 tok/s with full BF16 finishes in 4 hours but costs 5-6x as much in cloud-rental hours.

Run, monitor and verify

Launch with logging redirected and a separate monitoring shell.

nohup python train_70b_qlora.py > train.log 2>&1 &
nvtop
watch -n 5 'tail -n 20 train.log'
tensorboard --logdir 70b-lora --port 6006 --bind_all

Signal	8B QLoRA	32B QLoRA	70B QLoRA
VRAM	~12 GB	~22 GB	~23.4 GB
GPU utilisation	96-99%	96-99%	92-97%
Power	~410 W	~420 W	~430 W
Throughput	~3,800 tok/s	~820 tok/s	~1,800 tok/s
Loss after 100 steps	1.4-1.8	1.5-1.9	1.6-2.0

If GPU utilisation sits below 90% on the 70B, the bottleneck is almost always paged optimiser eviction; check that PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True is exported and that the host has at least 96 GB system RAM available for the optimiser swap.

Wall-clock and throughput

Times below are measured on the GigaGPU 4090 tier with NVMe-backed dataset, 50,000 samples averaging 1.2k tokens.

Model	Tokens/sec	Time per epoch	2-epoch wall clock	3-epoch wall clock
Llama 3.1 8B	3,800	4.4 h	~9 h	~13 h
Qwen 2.5 14B	2,100	7.9 h	~16 h	~24 h
Qwen 2.5 32B	820	20 h	~40 h	~60 h
Llama 3.1 70B	1,800	~9 h*	~16 h*	~25 h*

* The 70B QLoRA runs faster than the 32B in this table because we drop sequence length to 1024 and rank to 16. If you push the 70B to seq 2048 throughput halves to ~900 tok/s and the 2-epoch run becomes a four-day job. Either way, dedicated hosting beats per-hour cloud rentals decisively for runs of this length; see the 12-month ROI analysis.

Production gotchas

NF4 quantisation drift: NF4 is lossy; expect a 0.5-1.5 point regression on MMLU versus a full BF16 fine-tune of the same data. For most chat and tool-use tasks this is invisible, but for strict factual eval workloads benchmark before deploying.
Adapter merge: merge_and_unload() on a QLoRA model dequantises to BF16 first, briefly spiking VRAM by 30+ GB and OOMing on the 4090. Merge on CPU with model.to("cpu").merge_and_unload() for the 70B.
Resume from checkpoint: bitsandbytes optimiser state does not always restore cleanly across version bumps. Pin your environment for the entire run.
Checkpoint disk usage: a 70B QLoRA checkpoint with full optimiser state is ~7 GB. save_total_limit=3 prevents 2 TB NVMe filling overnight.
Flash-attn build with mismatched CUDA: install flash-attn with the same CUDA toolkit your PyTorch wheel was built against. Mixing 11.8 and 12.4 produces silent NaN losses on the first backward pass.
System RAM under-provisioned: paged AdamW pages to host RAM. The 70B run needs at least 96 GB free or the swap path falls back to disk and throughput drops 5x. If your box ships with 64 GB only, drop to 32B or order the larger RAM tier; see monthly hosting cost.
Thermal throttling: a 70B QLoRA run pulls 430 W for 12+ hours straight. Verify thermal performance ahead of time; a hot-spot above 92 C will throttle clocks and add hours to the wall clock.

Verdict

QLoRA on a 4090 is the cheapest credible way to put a 70B fine-tune through training right now. NF4 plus paged AdamW 8-bit lets a single Ada card train models that previously needed an A100 80 GB or H100, at a wall clock of 16-25 hours for a 2-3 epoch run on 50k samples. For 7B-32B QLoRA it is even more comfortable, finishing in single-digit hours. If you need full BF16 fine-tuning of 70B-class models, you are looking at multi-card or H100 rentals; for everything else, the 4090 is the right unit. Compare to the standard LoRA fine-tune guide when your model fits in BF16, and look at the FP8 deployment guide for serving the merged result.

Train a 70B adapter on a single GPU

QLoRA on the RTX 4090 24GB at a flat monthly rate. UK dedicated hosting.

Order the RTX 4090 24GB

QLoRA on RTX 4090 24GB: Fine-Tune Llama 3 70B in 24 GB

Contents

Why QLoRA fits the 4090

Memory budget on 24 GB

Environment and dependencies

Hyperparameters that work

Llama 70B QLoRA training script

Run, monitor and verify

Wall-clock and throughput

Production gotchas

Verdict

Train a 70B adapter on a single GPU

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

QLoRA on RTX 4090 24GB: Fine-Tune Llama 3 70B in 24 GB

Contents

Why QLoRA fits the 4090

Memory budget on 24 GB

Environment and dependencies

Hyperparameters that work

Llama 70B QLoRA training script

Run, monitor and verify

Wall-clock and throughput

Production gotchas

Verdict

Train a 70B adapter on a single GPU

Need a Dedicated GPU Server?

gigagpu

Related Articles

Scaling vLLM Across Two GPUs – What Actually Changes

RTX 5060 Ti 16GB llama.cpp Setup

CUDA Error: Device-Side Assert Triggered (Fix)

LoRAX Multi-LoRA Serving on a Dedicated GPU

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?