The RTX 4090 24GB dedicated server is currently the best-value single-GPU fine-tuning box on the market. Ada AD102’s 16,384 CUDA cores, 24 GB GDDR6X, 1008 GB/s memory bandwidth and native FP8 tensor cores comfortably handle LoRA on models up to 14B at FP16, QLoRA on models up to 70B with paged optimisers and Unsloth’s hand-tuned Triton kernels for a 1.7-1.9x baseline uplift. This article is the production playbook: scope, throughput tables, recipe choices, full memory budgets with a worked QLoRA 70B example, scaling triggers, hidden costs, gotchas. Wider hardware menu on dedicated GPU hosting.
Contents
- The named workloads: SFT, instruction tuning, domain adaptation
- Fine-tuning scope on 24 GB
- Throughput numbers and time-to-train
- Recipe choices: LoRA vs QLoRA vs Unsloth
- Memory budget worked examples
- QLoRA on Llama 70B: end-to-end recipe
- Cost per epoch vs cloud and managed alternatives
- Production gotchas and verdict
The named workloads: SFT, instruction tuning, domain adaptation
Three concrete reference workloads frame this article. Instruction-tuning Llama 3 8B on 50,000 in-house style examples (sequence 2048, batch 8 effective) for brand voice transfer; target wall-clock under 4 hours so iteration is daily. Domain adaptation of Qwen 2.5 14B on 30,000 legal-corpus examples at sequence 2048; target overnight wall-clock. QLoRA on Llama 3 70B for a high-quality vertical assistant on 50,000 examples at sequence 2048; target weekend wall-clock so a re-train is feasible monthly. The 4090 covers all three on a single card with the right recipe.
Why the 4090 fits this brief
FP16 LoRA on 8-14B models fits in 22-23 GB with gradient checkpointing. NF4 QLoRA on 70B fits because base weights drop to ~38 GB, of which the active working set including FlashAttention activations and paged AdamW state lives inside 22 GB at any moment, with the rest spilling to CPU pinned memory over PCIe. Native FP8 helps inference benchmarks but for fine-tune compute, bf16 is the right default on Ada. Cross-checked against the spec breakdown and the fine-tune throughput page.
Fine-tuning scope on 24 GB
The 4090’s 24 GB is enough to LoRA-fine-tune any model up to ~14B at FP16, and QLoRA-fine-tune up to 70B with NF4 weights and paged AdamW. Full-parameter SFT of larger models is out of scope on a single card; for that, use a multi-GPU setup or move to 5090 32 GB. Continued pre-training in short bursts on small (1-3B) models is feasible with bf16 and 8-bit Adam.
| Method | Model size ceiling | Practical notes |
|---|---|---|
| LoRA FP16 / bf16 | ~14B | Comfortable; rank 16-64; gradient checkpointing recommended above seq 2048 |
| QLoRA NF4 | ~70B | Paged AdamW essential; FlashAttention 2 essential; double-quant on |
| Full SFT FP16 | ~3B | Tight; offload optimiser to CPU; 8-bit Adam |
| Continued pre-train | ~3B | Use 8-bit Adam plus grad checkpointing; small batch |
| DPO / RLHF | ~13B (LoRA) | Reference policy LoRA is critical to fit two models; KTO is lighter |
| DoRA | ~14B | Slightly more VRAM than LoRA; small quality lift |
Throughput numbers and time-to-train
| Recipe | Model | Seq len | Batch | Tokens/s | Hours per epoch (50k samples) |
|---|---|---|---|---|---|
| LoRA bf16 | Llama 3 8B | 2048 | 8 | ~18,000 | ~1.6 |
| LoRA bf16 | Mistral 7B | 2048 | 8 | ~21,000 | ~1.4 |
| LoRA bf16 | Qwen 2.5 14B | 2048 | 2 | ~6,400 | ~4.4 |
| QLoRA NF4 | Llama 3 8B | 2048 | 8 | ~14,500 | ~2.0 |
| QLoRA NF4 | Llama 3 70B | 2048 | 1 | ~1,800 | ~16 |
| QLoRA NF4 | Llama 3 70B | 4096 | 1 | ~1,200 | ~24 |
| Unsloth LoRA | Llama 3 8B | 2048 | 8 | ~32,000 (1.78x) | ~0.9 |
| Unsloth QLoRA | Llama 3 70B | 2048 | 1 | ~3,200 (1.78x) | ~9 |
Unsloth’s hand-tuned Triton kernels deliver 1.7-1.9x baseline throughput on the 4090 for supported models with no quality regression. A 50k-sample LoRA on Llama 3 8B at Unsloth speed completes in under 2 hours; a 50k-sample QLoRA on Llama 3 70B in about 16 hours baseline or 9 hours under Unsloth. Full sweep including LongLoRA, DoRA and rank/sequence-length curves on the fine-tune throughput page and the QLoRA tutorial.
Recipe choices: LoRA vs QLoRA vs Unsloth
LoRA bf16 is the best quality-to-time ratio for models up to 14B. Use rank 16 for instruction-following style transfer, rank 32-64 for domain knowledge. QLoRA NF4 is the only practical way to touch 70B on a single 4090; quality is within 0.5-1.5 points of full FP16 LoRA on most benchmarks at one-fifth the VRAM. Unsloth is a drop-in 1.7-1.9x speedup with no accuracy loss for supported models (Llama, Mistral, Qwen, Gemma); always use it whenever the model is on its supported list.
| Recipe | VRAM efficiency | Quality | Speed | When to choose |
|---|---|---|---|---|
| LoRA bf16 rank 16 | Good | Excellent for style | Baseline | Style transfer, instruction tuning |
| LoRA bf16 rank 64 | Tight on 14B | Best for domain knowledge | ~70% of rank-16 | Domain adaptation, terminology |
| QLoRA NF4 rank 16 | Best | ~0.5-1.5 pts under LoRA | ~80% of LoRA same model | Bigger model on small VRAM |
| Unsloth LoRA | Same as LoRA | Identical | 1.7-1.9x | Always if supported |
| DoRA | ~10% more than LoRA | +0.3-0.7 pts | ~10% slower | When quality matters more than time |
Memory budget worked examples
LoRA on Llama 3 8B bf16, sequence 2048, batch 8, rank 16:
| Component | VRAM |
|---|---|
| Base weights (bf16) | 15.0 GB |
| LoRA adapters + optimiser states | 0.4 GB |
| Activations (grad ckpt on) | 4.8 GB |
| Workspace + cuBLAS | 1.6 GB |
| CUDA + driver | 0.6 GB |
| Total | ~22.4 GB |
For QLoRA on 70B the base drops to ~38 GB at NF4, but the trick is that FlashAttention 2 plus aggressive grad checkpointing keep activations under 4 GB; with paged AdamW spilling optimiser state to CPU pinned memory, the working set fits in 22 GB at any moment. PCIe Gen4 x16 bandwidth (~32 GB/s) keeps spill latency from dominating wall-clock.
QLoRA on Llama 70B: end-to-end recipe
The reference QLoRA recipe for Llama 3 70B at sequence 2048 and batch 1 with gradient accumulation 8 (effective batch 8). Paged AdamW 8-bit is essential; without it the optimiser state OOMs at step 1.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16", bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B-Instruct",
quantization_config=bnb, device_map="auto")
model = prepare_model_for_kbit_training(model)
model.gradient_checkpointing_enable()
peft_cfg = LoraConfig(r=16, lora_alpha=32,
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, peft_cfg)
args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=8,
learning_rate=1e-4, bf16=True, optim="paged_adamw_8bit",
gradient_checkpointing=True, max_grad_norm=0.3)
Full walkthrough with dataset prep, evaluation and adapter export on the QLoRA tutorial; LoRA equivalent on the LoRA tutorial.
Cost per epoch vs cloud and managed alternatives
At ~18,000 tokens/sec on Llama 3 8B LoRA (32,000 with Unsloth), an epoch over 50,000 samples (~100 M tokens at sequence 2048) takes about 95 minutes baseline or 53 minutes under Unsloth. A complete 3-epoch SFT job finishes inside 5 hours baseline. The 4090 dedicated runs ~£550/month flat; that is roughly £0.76/hour at 100% utilisation, but in practice fine-tune jobs are bursty so the effective per-job cost is the wall-clock fraction of monthly rental.
| Job | 4090 dedicated wall | 4090 cost share | AWS g6.4xlarge ($1.32/h) | Together fine-tune ($/M tok) | OpenAI fine-tune ($25/M) |
|---|---|---|---|---|---|
| 50k SFT Llama 3 8B (3 epoch) | ~5 h baseline / 2.7 h Unsloth | ~£3.80 | ~$6.60 | ~$8 (Together $0.40/M for 8B) | ~$2,500 |
| 50k SFT Qwen 14B | ~13 h | ~£10 | ~$17 | ~$15 | n/a |
| 50k QLoRA Llama 70B | ~16 h baseline / 9 h Unsloth | ~£12 | ~$21 (also won’t fit on L4!) | ~$60 | n/a |
Hidden costs
- Engineer time: ~8 hours setup, ~1 hour/week ongoing maintenance. At £80/hour blended that is ~£640 setup plus £4,160/year maintenance.
- Data prep: not a GPU cost but typically 30-50% of total project time. Budget for it explicitly.
- Eval harness: ~1-2 days to build a reproducible eval; pays back tenfold in iteration speed.
- Storage for checkpoints: included in the dedicated server (NVMe is bundled), unlike cloud where checkpoint S3 costs add up.
Full 12-month TCO comparison on the ROI analysis; cost-per-job comparisons on the monthly hosting cost page.
Production gotchas and verdict
Production gotchas
- QLoRA without paged AdamW OOMs at step 1: the optimiser state for 70B does not fit even at NF4 unless paged. Always set
optim="paged_adamw_8bit". - Gradient checkpointing turned off “for speed” eats 4-6 GB activations: above seq 1024 it is non-negotiable; the speed cost is ~15% but the VRAM cost is decisive.
- FlashAttention 2 version drift: pin
flash-attn==2.6.3or later for Ada; older versions silently fall back to slower kernels. - fp16 with Llama 3 needs constant loss-scale fiddling: use bf16 not fp16 on Ada; native bf16 support eliminates the loss-scale dance.
- Unsloth model coverage gaps: the supported list is narrow (Llama, Mistral, Qwen, Gemma). Custom architectures fall back to baseline; verify before promising the 1.8x.
- LoRA target modules wrong for the model: copying a Llama target list to Mistral or Qwen subtly mistrains; always use the model’s published target list.
- Power-cap mismatch slowing training: leave
nvidia-smi -pl 450default for fine-tune jobs; the 400W cap helps inference per-watt but costs ~10% on training throughput.
Verdict
For LoRA fine-tuning of any model up to 14B and QLoRA fine-tuning up to 70B, a single 4090 dedicated box is the cheapest credible production option in 2026. A daily LoRA iteration on 8B is a 1-2 hour job under Unsloth; a weekly QLoRA on 70B is a single overnight run. Above 14B FP16 LoRA or for full SFT past 3B, evaluate the 5090 32 GB or multi-GPU. For one-off experiments at sub-50k samples, hosted fine-tune APIs are still convenient; for any iterative workflow the dedicated box pays back inside a month.
Dedicated fine-tune box, no queues, flat monthly
LoRA up to 14B and QLoRA up to 70B on a single card. UK dedicated hosting.
Order the RTX 4090 24GBSee also: LoRA tutorial, QLoRA tutorial, throughput tables, multi-tenant SaaS, monthly cost, 70B INT4 deployment, ROI analysis.