RTX 3050 - Order Now
Home / Blog / Use Cases / RTX 4090 24GB as a Dedicated Fine-Tuning Box: LoRA, QLoRA, Unsloth, Memory Maths
Use Cases

RTX 4090 24GB as a Dedicated Fine-Tuning Box: LoRA, QLoRA, Unsloth, Memory Maths

Production playbook for the RTX 4090 24GB as a dedicated fine-tune box: LoRA on 14B FP16, QLoRA on 70B with paged AdamW, Unsloth's 1.7-1.9x uplift, full memory budgets, throughput tables, recipe choices and gotchas.

The RTX 4090 24GB dedicated server is currently the best-value single-GPU fine-tuning box on the market. Ada AD102’s 16,384 CUDA cores, 24 GB GDDR6X, 1008 GB/s memory bandwidth and native FP8 tensor cores comfortably handle LoRA on models up to 14B at FP16, QLoRA on models up to 70B with paged optimisers and Unsloth’s hand-tuned Triton kernels for a 1.7-1.9x baseline uplift. This article is the production playbook: scope, throughput tables, recipe choices, full memory budgets with a worked QLoRA 70B example, scaling triggers, hidden costs, gotchas. Wider hardware menu on dedicated GPU hosting.

Contents

The named workloads: SFT, instruction tuning, domain adaptation

Three concrete reference workloads frame this article. Instruction-tuning Llama 3 8B on 50,000 in-house style examples (sequence 2048, batch 8 effective) for brand voice transfer; target wall-clock under 4 hours so iteration is daily. Domain adaptation of Qwen 2.5 14B on 30,000 legal-corpus examples at sequence 2048; target overnight wall-clock. QLoRA on Llama 3 70B for a high-quality vertical assistant on 50,000 examples at sequence 2048; target weekend wall-clock so a re-train is feasible monthly. The 4090 covers all three on a single card with the right recipe.

Why the 4090 fits this brief

FP16 LoRA on 8-14B models fits in 22-23 GB with gradient checkpointing. NF4 QLoRA on 70B fits because base weights drop to ~38 GB, of which the active working set including FlashAttention activations and paged AdamW state lives inside 22 GB at any moment, with the rest spilling to CPU pinned memory over PCIe. Native FP8 helps inference benchmarks but for fine-tune compute, bf16 is the right default on Ada. Cross-checked against the spec breakdown and the fine-tune throughput page.

Fine-tuning scope on 24 GB

The 4090’s 24 GB is enough to LoRA-fine-tune any model up to ~14B at FP16, and QLoRA-fine-tune up to 70B with NF4 weights and paged AdamW. Full-parameter SFT of larger models is out of scope on a single card; for that, use a multi-GPU setup or move to 5090 32 GB. Continued pre-training in short bursts on small (1-3B) models is feasible with bf16 and 8-bit Adam.

MethodModel size ceilingPractical notes
LoRA FP16 / bf16~14BComfortable; rank 16-64; gradient checkpointing recommended above seq 2048
QLoRA NF4~70BPaged AdamW essential; FlashAttention 2 essential; double-quant on
Full SFT FP16~3BTight; offload optimiser to CPU; 8-bit Adam
Continued pre-train~3BUse 8-bit Adam plus grad checkpointing; small batch
DPO / RLHF~13B (LoRA)Reference policy LoRA is critical to fit two models; KTO is lighter
DoRA~14BSlightly more VRAM than LoRA; small quality lift

Throughput numbers and time-to-train

RecipeModelSeq lenBatchTokens/sHours per epoch (50k samples)
LoRA bf16Llama 3 8B20488~18,000~1.6
LoRA bf16Mistral 7B20488~21,000~1.4
LoRA bf16Qwen 2.5 14B20482~6,400~4.4
QLoRA NF4Llama 3 8B20488~14,500~2.0
QLoRA NF4Llama 3 70B20481~1,800~16
QLoRA NF4Llama 3 70B40961~1,200~24
Unsloth LoRALlama 3 8B20488~32,000 (1.78x)~0.9
Unsloth QLoRALlama 3 70B20481~3,200 (1.78x)~9

Unsloth’s hand-tuned Triton kernels deliver 1.7-1.9x baseline throughput on the 4090 for supported models with no quality regression. A 50k-sample LoRA on Llama 3 8B at Unsloth speed completes in under 2 hours; a 50k-sample QLoRA on Llama 3 70B in about 16 hours baseline or 9 hours under Unsloth. Full sweep including LongLoRA, DoRA and rank/sequence-length curves on the fine-tune throughput page and the QLoRA tutorial.

Recipe choices: LoRA vs QLoRA vs Unsloth

LoRA bf16 is the best quality-to-time ratio for models up to 14B. Use rank 16 for instruction-following style transfer, rank 32-64 for domain knowledge. QLoRA NF4 is the only practical way to touch 70B on a single 4090; quality is within 0.5-1.5 points of full FP16 LoRA on most benchmarks at one-fifth the VRAM. Unsloth is a drop-in 1.7-1.9x speedup with no accuracy loss for supported models (Llama, Mistral, Qwen, Gemma); always use it whenever the model is on its supported list.

RecipeVRAM efficiencyQualitySpeedWhen to choose
LoRA bf16 rank 16GoodExcellent for styleBaselineStyle transfer, instruction tuning
LoRA bf16 rank 64Tight on 14BBest for domain knowledge~70% of rank-16Domain adaptation, terminology
QLoRA NF4 rank 16Best~0.5-1.5 pts under LoRA~80% of LoRA same modelBigger model on small VRAM
Unsloth LoRASame as LoRAIdentical1.7-1.9xAlways if supported
DoRA~10% more than LoRA+0.3-0.7 pts~10% slowerWhen quality matters more than time

Memory budget worked examples

LoRA on Llama 3 8B bf16, sequence 2048, batch 8, rank 16:

ComponentVRAM
Base weights (bf16)15.0 GB
LoRA adapters + optimiser states0.4 GB
Activations (grad ckpt on)4.8 GB
Workspace + cuBLAS1.6 GB
CUDA + driver0.6 GB
Total~22.4 GB

For QLoRA on 70B the base drops to ~38 GB at NF4, but the trick is that FlashAttention 2 plus aggressive grad checkpointing keep activations under 4 GB; with paged AdamW spilling optimiser state to CPU pinned memory, the working set fits in 22 GB at any moment. PCIe Gen4 x16 bandwidth (~32 GB/s) keeps spill latency from dominating wall-clock.

QLoRA on Llama 70B: end-to-end recipe

The reference QLoRA recipe for Llama 3 70B at sequence 2048 and batch 1 with gradient accumulation 8 (effective batch 8). Paged AdamW 8-bit is essential; without it the optimiser state OOMs at step 1.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype="bfloat16", bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B-Instruct",
  quantization_config=bnb, device_map="auto")
model = prepare_model_for_kbit_training(model)
model.gradient_checkpointing_enable()
peft_cfg = LoraConfig(r=16, lora_alpha=32,
  target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
  bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, peft_cfg)
args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=8,
  learning_rate=1e-4, bf16=True, optim="paged_adamw_8bit",
  gradient_checkpointing=True, max_grad_norm=0.3)

Full walkthrough with dataset prep, evaluation and adapter export on the QLoRA tutorial; LoRA equivalent on the LoRA tutorial.

Cost per epoch vs cloud and managed alternatives

At ~18,000 tokens/sec on Llama 3 8B LoRA (32,000 with Unsloth), an epoch over 50,000 samples (~100 M tokens at sequence 2048) takes about 95 minutes baseline or 53 minutes under Unsloth. A complete 3-epoch SFT job finishes inside 5 hours baseline. The 4090 dedicated runs ~£550/month flat; that is roughly £0.76/hour at 100% utilisation, but in practice fine-tune jobs are bursty so the effective per-job cost is the wall-clock fraction of monthly rental.

Job4090 dedicated wall4090 cost shareAWS g6.4xlarge ($1.32/h)Together fine-tune ($/M tok)OpenAI fine-tune ($25/M)
50k SFT Llama 3 8B (3 epoch)~5 h baseline / 2.7 h Unsloth~£3.80~$6.60~$8 (Together $0.40/M for 8B)~$2,500
50k SFT Qwen 14B~13 h~£10~$17~$15n/a
50k QLoRA Llama 70B~16 h baseline / 9 h Unsloth~£12~$21 (also won’t fit on L4!)~$60n/a

Hidden costs

  • Engineer time: ~8 hours setup, ~1 hour/week ongoing maintenance. At £80/hour blended that is ~£640 setup plus £4,160/year maintenance.
  • Data prep: not a GPU cost but typically 30-50% of total project time. Budget for it explicitly.
  • Eval harness: ~1-2 days to build a reproducible eval; pays back tenfold in iteration speed.
  • Storage for checkpoints: included in the dedicated server (NVMe is bundled), unlike cloud where checkpoint S3 costs add up.

Full 12-month TCO comparison on the ROI analysis; cost-per-job comparisons on the monthly hosting cost page.

Production gotchas and verdict

Production gotchas

  1. QLoRA without paged AdamW OOMs at step 1: the optimiser state for 70B does not fit even at NF4 unless paged. Always set optim="paged_adamw_8bit".
  2. Gradient checkpointing turned off “for speed” eats 4-6 GB activations: above seq 1024 it is non-negotiable; the speed cost is ~15% but the VRAM cost is decisive.
  3. FlashAttention 2 version drift: pin flash-attn==2.6.3 or later for Ada; older versions silently fall back to slower kernels.
  4. fp16 with Llama 3 needs constant loss-scale fiddling: use bf16 not fp16 on Ada; native bf16 support eliminates the loss-scale dance.
  5. Unsloth model coverage gaps: the supported list is narrow (Llama, Mistral, Qwen, Gemma). Custom architectures fall back to baseline; verify before promising the 1.8x.
  6. LoRA target modules wrong for the model: copying a Llama target list to Mistral or Qwen subtly mistrains; always use the model’s published target list.
  7. Power-cap mismatch slowing training: leave nvidia-smi -pl 450 default for fine-tune jobs; the 400W cap helps inference per-watt but costs ~10% on training throughput.

Verdict

For LoRA fine-tuning of any model up to 14B and QLoRA fine-tuning up to 70B, a single 4090 dedicated box is the cheapest credible production option in 2026. A daily LoRA iteration on 8B is a 1-2 hour job under Unsloth; a weekly QLoRA on 70B is a single overnight run. Above 14B FP16 LoRA or for full SFT past 3B, evaluate the 5090 32 GB or multi-GPU. For one-off experiments at sub-50k samples, hosted fine-tune APIs are still convenient; for any iterative workflow the dedicated box pays back inside a month.

Dedicated fine-tune box, no queues, flat monthly

LoRA up to 14B and QLoRA up to 70B on a single card. UK dedicated hosting.

Order the RTX 4090 24GB

See also: LoRA tutorial, QLoRA tutorial, throughput tables, multi-tenant SaaS, monthly cost, 70B INT4 deployment, ROI analysis.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?