RTX 4090 24GB as a Dedicated Fine-Tuning Box: LoRA, QLoRA, Unsloth, Memory Maths GIGAGPU

The RTX 4090 24GB dedicated server is currently the best-value single-GPU fine-tuning box on the market. Ada AD102’s 16,384 CUDA cores, 24 GB GDDR6X, 1008 GB/s memory bandwidth and native FP8 tensor cores comfortably handle LoRA on models up to 14B at FP16, QLoRA on models up to 70B with paged optimisers and Unsloth’s hand-tuned Triton kernels for a 1.7-1.9x baseline uplift. This article is the production playbook: scope, throughput tables, recipe choices, full memory budgets with a worked QLoRA 70B example, scaling triggers, hidden costs, gotchas. Wider hardware menu on dedicated GPU hosting.

The named workloads: SFT, instruction tuning, domain adaptation

Three concrete reference workloads frame this article. Instruction-tuning Llama 3 8B on 50,000 in-house style examples (sequence 2048, batch 8 effective) for brand voice transfer; target wall-clock under 4 hours so iteration is daily. Domain adaptation of Qwen 2.5 14B on 30,000 legal-corpus examples at sequence 2048; target overnight wall-clock. QLoRA on Llama 3 70B for a high-quality vertical assistant on 50,000 examples at sequence 2048; target weekend wall-clock so a re-train is feasible monthly. The 4090 covers all three on a single card with the right recipe.

Why the 4090 fits this brief

FP16 LoRA on 8-14B models fits in 22-23 GB with gradient checkpointing. NF4 QLoRA on 70B fits because base weights drop to ~38 GB, of which the active working set including FlashAttention activations and paged AdamW state lives inside 22 GB at any moment, with the rest spilling to CPU pinned memory over PCIe. Native FP8 helps inference benchmarks but for fine-tune compute, bf16 is the right default on Ada. Cross-checked against the spec breakdown and the fine-tune throughput page.

Fine-tuning scope on 24 GB

The 4090’s 24 GB is enough to LoRA-fine-tune any model up to ~14B at FP16, and QLoRA-fine-tune up to 70B with NF4 weights and paged AdamW. Full-parameter SFT of larger models is out of scope on a single card; for that, use a multi-GPU setup or move to 5090 32 GB. Continued pre-training in short bursts on small (1-3B) models is feasible with bf16 and 8-bit Adam.

Method	Model size ceiling	Practical notes
LoRA FP16 / bf16	~14B	Comfortable; rank 16-64; gradient checkpointing recommended above seq 2048
QLoRA NF4	~70B	Paged AdamW essential; FlashAttention 2 essential; double-quant on
Full SFT FP16	~3B	Tight; offload optimiser to CPU; 8-bit Adam
Continued pre-train	~3B	Use 8-bit Adam plus grad checkpointing; small batch
DPO / RLHF	~13B (LoRA)	Reference policy LoRA is critical to fit two models; KTO is lighter
DoRA	~14B	Slightly more VRAM than LoRA; small quality lift

Throughput numbers and time-to-train

Recipe	Model	Seq len	Batch	Tokens/s	Hours per epoch (50k samples)
LoRA bf16	Llama 3 8B	2048	8	~18,000	~1.6
LoRA bf16	Mistral 7B	2048	8	~21,000	~1.4
LoRA bf16	Qwen 2.5 14B	2048	2	~6,400	~4.4
QLoRA NF4	Llama 3 8B	2048	8	~14,500	~2.0
QLoRA NF4	Llama 3 70B	2048	1	~1,800	~16
QLoRA NF4	Llama 3 70B	4096	1	~1,200	~24
Unsloth LoRA	Llama 3 8B	2048	8	~32,000 (1.78x)	~0.9
Unsloth QLoRA	Llama 3 70B	2048	1	~3,200 (1.78x)	~9

Unsloth’s hand-tuned Triton kernels deliver 1.7-1.9x baseline throughput on the 4090 for supported models with no quality regression. A 50k-sample LoRA on Llama 3 8B at Unsloth speed completes in under 2 hours; a 50k-sample QLoRA on Llama 3 70B in about 16 hours baseline or 9 hours under Unsloth. Full sweep including LongLoRA, DoRA and rank/sequence-length curves on the fine-tune throughput page and the QLoRA tutorial.

Recipe choices: LoRA vs QLoRA vs Unsloth

LoRA bf16 is the best quality-to-time ratio for models up to 14B. Use rank 16 for instruction-following style transfer, rank 32-64 for domain knowledge. QLoRA NF4 is the only practical way to touch 70B on a single 4090; quality is within 0.5-1.5 points of full FP16 LoRA on most benchmarks at one-fifth the VRAM. Unsloth is a drop-in 1.7-1.9x speedup with no accuracy loss for supported models (Llama, Mistral, Qwen, Gemma); always use it whenever the model is on its supported list.

Recipe	VRAM efficiency	Quality	Speed	When to choose
LoRA bf16 rank 16	Good	Excellent for style	Baseline	Style transfer, instruction tuning
LoRA bf16 rank 64	Tight on 14B	Best for domain knowledge	~70% of rank-16	Domain adaptation, terminology
QLoRA NF4 rank 16	Best	~0.5-1.5 pts under LoRA	~80% of LoRA same model	Bigger model on small VRAM
Unsloth LoRA	Same as LoRA	Identical	1.7-1.9x	Always if supported
DoRA	~10% more than LoRA	+0.3-0.7 pts	~10% slower	When quality matters more than time

Memory budget worked examples

LoRA on Llama 3 8B bf16, sequence 2048, batch 8, rank 16:

Component	VRAM
Base weights (bf16)	15.0 GB
LoRA adapters + optimiser states	0.4 GB
Activations (grad ckpt on)	4.8 GB
Workspace + cuBLAS	1.6 GB
CUDA + driver	0.6 GB
Total	~22.4 GB

For QLoRA on 70B the base drops to ~38 GB at NF4, but the trick is that FlashAttention 2 plus aggressive grad checkpointing keep activations under 4 GB; with paged AdamW spilling optimiser state to CPU pinned memory, the working set fits in 22 GB at any moment. PCIe Gen4 x16 bandwidth (~32 GB/s) keeps spill latency from dominating wall-clock.

QLoRA on Llama 70B: end-to-end recipe

The reference QLoRA recipe for Llama 3 70B at sequence 2048 and batch 1 with gradient accumulation 8 (effective batch 8). Paged AdamW 8-bit is essential; without it the optimiser state OOMs at step 1.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype="bfloat16", bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B-Instruct",
  quantization_config=bnb, device_map="auto")
model = prepare_model_for_kbit_training(model)
model.gradient_checkpointing_enable()
peft_cfg = LoraConfig(r=16, lora_alpha=32,
  target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
  bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, peft_cfg)
args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=8,
  learning_rate=1e-4, bf16=True, optim="paged_adamw_8bit",
  gradient_checkpointing=True, max_grad_norm=0.3)

Full walkthrough with dataset prep, evaluation and adapter export on the QLoRA tutorial; LoRA equivalent on the LoRA tutorial.

Cost per epoch vs cloud and managed alternatives

At ~18,000 tokens/sec on Llama 3 8B LoRA (32,000 with Unsloth), an epoch over 50,000 samples (~100 M tokens at sequence 2048) takes about 95 minutes baseline or 53 minutes under Unsloth. A complete 3-epoch SFT job finishes inside 5 hours baseline. The 4090 dedicated runs ~£550/month flat; that is roughly £0.76/hour at 100% utilisation, but in practice fine-tune jobs are bursty so the effective per-job cost is the wall-clock fraction of monthly rental.

Job	4090 dedicated wall	4090 cost share	AWS g6.4xlarge ($1.32/h)	Together fine-tune ($/M tok)	OpenAI fine-tune ($25/M)
50k SFT Llama 3 8B (3 epoch)	~5 h baseline / 2.7 h Unsloth	~£3.80	~$6.60	~$8 (Together $0.40/M for 8B)	~$2,500
50k SFT Qwen 14B	~13 h	~£10	~$17	~$15	n/a
50k QLoRA Llama 70B	~16 h baseline / 9 h Unsloth	~£12	~$21 (also won’t fit on L4!)	~$60	n/a

Hidden costs

Engineer time: ~8 hours setup, ~1 hour/week ongoing maintenance. At £80/hour blended that is ~£640 setup plus £4,160/year maintenance.
Data prep: not a GPU cost but typically 30-50% of total project time. Budget for it explicitly.
Eval harness: ~1-2 days to build a reproducible eval; pays back tenfold in iteration speed.
Storage for checkpoints: included in the dedicated server (NVMe is bundled), unlike cloud where checkpoint S3 costs add up.

Full 12-month TCO comparison on the ROI analysis; cost-per-job comparisons on the monthly hosting cost page.

Production gotchas and verdict

Production gotchas

QLoRA without paged AdamW OOMs at step 1: the optimiser state for 70B does not fit even at NF4 unless paged. Always set optim="paged_adamw_8bit".
Gradient checkpointing turned off “for speed” eats 4-6 GB activations: above seq 1024 it is non-negotiable; the speed cost is ~15% but the VRAM cost is decisive.
FlashAttention 2 version drift: pin flash-attn==2.6.3 or later for Ada; older versions silently fall back to slower kernels.
fp16 with Llama 3 needs constant loss-scale fiddling: use bf16 not fp16 on Ada; native bf16 support eliminates the loss-scale dance.
Unsloth model coverage gaps: the supported list is narrow (Llama, Mistral, Qwen, Gemma). Custom architectures fall back to baseline; verify before promising the 1.8x.
LoRA target modules wrong for the model: copying a Llama target list to Mistral or Qwen subtly mistrains; always use the model’s published target list.
Power-cap mismatch slowing training: leave nvidia-smi -pl 450 default for fine-tune jobs; the 400W cap helps inference per-watt but costs ~10% on training throughput.

Verdict

For LoRA fine-tuning of any model up to 14B and QLoRA fine-tuning up to 70B, a single 4090 dedicated box is the cheapest credible production option in 2026. A daily LoRA iteration on 8B is a 1-2 hour job under Unsloth; a weekly QLoRA on 70B is a single overnight run. Above 14B FP16 LoRA or for full SFT past 3B, evaluate the 5090 32 GB or multi-GPU. For one-off experiments at sub-50k samples, hosted fine-tune APIs are still convenient; for any iterative workflow the dedicated box pays back inside a month.

Dedicated fine-tune box, no queues, flat monthly

LoRA up to 14B and QLoRA up to 70B on a single card. UK dedicated hosting.

Order the RTX 4090 24GB

RTX 4090 24GB as a Dedicated Fine-Tuning Box: LoRA, QLoRA, Unsloth, Memory Maths

Contents

The named workloads: SFT, instruction tuning, domain adaptation

Why the 4090 fits this brief

Fine-tuning scope on 24 GB

Throughput numbers and time-to-train

Recipe choices: LoRA vs QLoRA vs Unsloth

Memory budget worked examples

QLoRA on Llama 70B: end-to-end recipe

Cost per epoch vs cloud and managed alternatives

Hidden costs

Production gotchas and verdict

Production gotchas

Verdict

Dedicated fine-tune box, no queues, flat monthly

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB as a Dedicated Fine-Tuning Box: LoRA, QLoRA, Unsloth, Memory Maths

Contents

The named workloads: SFT, instruction tuning, domain adaptation

Why the 4090 fits this brief

Fine-tuning scope on 24 GB

Throughput numbers and time-to-train

Recipe choices: LoRA vs QLoRA vs Unsloth

Memory budget worked examples

QLoRA on Llama 70B: end-to-end recipe

Cost per epoch vs cloud and managed alternatives

Hidden costs

Production gotchas and verdict

Production gotchas

Verdict

Dedicated fine-tune box, no queues, flat monthly

Need a Dedicated GPU Server?

gigagpu

Related Articles

Self-Hosted Customer Support Chatbot: Architecture and Hardware Sizing

Healthcare Compliance AI: GPU Server for Clinical Audit and Regulatory Reporting

RTX 4090 24GB for an Academic Research Lab: 70B Inference, 14B Fine-Tunes, Eval Sweeps Without Queues

RTX 4090 24GB for End-to-End Voice Assistants

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?