An RTX 4090 24GB dedicated server is the most useful single card a small academic lab can rent today. It runs Llama 3.1 70B at AWQ INT4 for strong baseline runs, fine-tunes models up to 14B in FP16 with PEFT, supports QLoRA up to 70B for the corner cases that matter, and sits permanently available for the eval sweeps that define a research line. No HPC queue, no opaque cost accounting, no broken module system, no two-day waits to test a hyperparameter. This article surveys the workloads that fit comfortably on one card, the throughput numbers, the budget envelope, and a worked weekly thesis chapter workflow. Compare with the wider dedicated GPU hosting range.
Contents
- Why one dedicated 4090 beats a slot in a shared cluster
- Inference scope on 24 GB
- Fine-tuning scope: LoRA, QLoRA, Unsloth
- Evals, reproducibility and harness fit
- A worked weekly thesis-chapter workflow
- Capacity, scaling triggers and pitfalls
- Budget envelope and grant fit
- Verdict and decision criteria
Why one dedicated 4090 beats a slot in a shared cluster
Shared HPC slots come with queue waits of hours to days, opaque cost accounting through internal “compute units”, brittle module systems that break the week before a deadline, and the constant low-grade anxiety that someone else’s job will preempt yours. A dedicated 4090 gives you a fixed monthly cost, root access, persistent NVMe storage, the ability to leave a model loaded for weeks while you iterate on prompts and evals, and full control over the Python environment. For a postgrad doing weekly experiments, the wall-clock saving from skipping queues alone often pays for the rental. The card sits in a UK rack, so data residency is solved for studies that touch GDPR-scoped datasets.
The other quiet advantage is reproducibility. A shared cluster can change drivers, CUDA toolkits, or interconnect topologies under your feet between Tuesday and Thursday; a dedicated box stays exactly as you left it. For methods chapters where reproducibility is the whole point, that matters. See the first day checklist for the pinning playbook.
Inference scope on 24 GB
The 4090’s 24 GB and native FP8 tensor cores cover an unusually wide model range for a single consumer-class card. Llama 3.1 70B AWQ INT4 fits with FP8 KV at 16k context. Mid-range 14B-32B models run at interactive speed. Small 7-8B models hit 200+ t/s decode for batch experiments where you sweep across thousands of prompts.
| Model | Quant | VRAM | Decode t/s | Typical research use |
|---|---|---|---|---|
| Llama 3.1 70B | AWQ INT4 | 17.0 GB + KV | 22-24 | Strong baseline, judge model |
| Mixtral 8x7B | AWQ INT4 | 20.5 GB | 85 | MoE comparison |
| Qwen 2.5 32B | AWQ INT4 | 19.1 GB | 65 | Multilingual reasoning |
| Qwen 2.5 14B | AWQ INT4 | 10.2 GB | 135 | Main workhorse |
| Llama 3.1 8B | FP8 | 9.5 GB | 195 | Eval grids, ablations |
| Mistral 7B | FP8 | 8.6 GB | 215 | Sanity baselines |
| Phi-3 mini | FP8 | 4.8 GB | 480 | Throughput tests, scaling laws |
| Mistral Nemo 12B | FP8 | 13.5 GB | 145 | Mid-tier multilingual |
For deeper figures see the Llama 70B INT4 benchmark, the Llama 8B benchmark, the Qwen 32B benchmark and the prefill/decode benchmark. The 70B is best framed as a “strong baseline / judge model” rather than a high-throughput frontline; for fast eval sweeps over tens of thousands of prompts, 8B FP8 is the right pick.
Fine-tuning scope: LoRA, QLoRA, Unsloth
With PEFT (LoRA and QLoRA), a single 4090 covers full-FP16 LoRA up to 14B and QLoRA up to 70B with paged AdamW. Sample throughput at sequence length 2048:
| Recipe | Model | Batch | Throughput | Hours per epoch (10k samples) |
|---|---|---|---|---|
| LoRA bf16 | Llama 3 8B | 8 | ~18,000 tok/s | 0.32 |
| LoRA bf16, seq 4096 | Llama 3 8B | 4 | ~14,000 tok/s | 0.41 |
| LoRA bf16 | Qwen 2.5 14B | 4 | ~9,500 tok/s | 0.6 |
| QLoRA NF4 | Llama 3 8B | 8 | ~14,500 tok/s | 0.39 |
| QLoRA NF4, seq 4096 | Llama 3 70B | 1 | ~1,200 tok/s | 4.7 |
| QLoRA NF4 | Llama 3 70B | 1 | ~1,800 tok/s | 3.2 |
| Unsloth bf16 | Llama 3 8B | 8 | ~32,000 tok/s | 0.18 |
Unsloth’s hand-tuned Triton kernels deliver 1.7x to 1.9x baseline throughput on the 4090. See the fine-tune throughput page, the LoRA tutorial and the QLoRA tutorial for full walkthroughs.
Standard LoRA recipe in PEFT
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype="bfloat16", device_map="auto",
)
peft_config = LoraConfig(
r=16, lora_alpha=32,
target_modules=["q_proj","k_proj","v_proj","o_proj"],
bias="none", task_type="CAUSAL_LM",
)
model = get_peft_model(model, peft_config)
training_args = TrainingArguments(
output_dir="./out", num_train_epochs=3,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=2e-4, bf16=True,
optim="adamw_8bit", logging_steps=10, save_steps=200,
)
Why each line. torch_dtype="bfloat16" avoids the loss-scale fiddling FP16 requires on Ada. device_map="auto" places everything on the single GPU. r=16, lora_alpha=32 is a balanced rank for instruction-following style transfer; rank 32-64 for domain knowledge injection. target_modules applies LoRA only to the attention projections, halving the adapter parameter count versus also targeting MLPs at minimal quality cost. per_device_train_batch_size=8 with gradient_accumulation_steps=2 gives an effective batch of 16 — the sweet spot for stable training at sequence length 2048 on this card. learning_rate=2e-4 is the established LoRA default for Llama-class models. optim="adamw_8bit" via bitsandbytes halves optimiser-state VRAM at no measurable quality cost. The full QLoRA variant adds load_in_4bit=True with the bnb config from the QLoRA tutorial.
Evals, reproducibility and harness fit
Eval harnesses such as lm-eval-harness, lighteval and AlpacaEval-style judging fit comfortably alongside an inference server. A typical broad eval (MMLU, HellaSwag, ARC, GSM8K, IFEval) on an 8B FP8 model finishes in 2-4 hours per checkpoint on a single 4090. With prefix caching enabled, repeated few-shot prompts cut wall time by 30-50%. The 4090 fits the standard “judge model” pattern cleanly: serve Llama 3 8B FP8 as the candidate and Llama 3.1 70B AWQ INT4 as the judge, both on the same card if context budget permits, or split across two cards via multi-card pairing.
| Eval suite | Model | Wall time per checkpoint |
|---|---|---|
| MMLU (5-shot, 14k items) | Llama 3 8B FP8 | ~85 minutes |
| HellaSwag (10k items) | Llama 3 8B FP8 | ~55 minutes |
| GSM8K (1.3k items, CoT) | Llama 3 8B FP8 | ~25 minutes |
| IFEval (~540 items) | Llama 3 8B FP8 | ~10 minutes |
| AlpacaEval-LC (judge) | Llama 70B AWQ judge | ~6-8 hours per 805 items |
For reproducibility pin every layer: NVIDIA driver, CUDA toolkit, vLLM version, transformers version, dataset commit hash. Snapshot to local NVMe so a dependency push upstream cannot invalidate a published result.
A worked weekly thesis-chapter workflow
A typical week: Monday, run a baseline eval on Llama 3 8B FP8 across MMLU plus your domain eval (~3 hours wall). Tuesday, prepare a domain corpus and launch a LoRA fine-tune at rank 16 (~3 hours for a 30k-sample corpus). Wednesday morning, re-run the eval against the fine-tuned checkpoint; afternoon, qualitative inspection of generations. Thursday, ablate one hyperparameter (e.g. rank 32 vs 16, or different target modules) and run again (~3 hours). Friday, generate qualitative samples for your write-up using the 70B AWQ model as a stronger reference, plus AlpacaEval judging if you are comparing alignment recipes. All on the same card, no queue, no cluster ticket. Across a 12-week chapter cycle, expect ~50 fine-tunes and ~150 eval runs comfortably within the budget.
Capacity, scaling triggers and pitfalls
Capacity. The card sustains: 50-80 fine-tunes per month at the 8B-LoRA scale, 8-12 fine-tunes at 14B, 2-4 at 70B QLoRA. Eval throughput: 4-6 broad-suite runs per day. Inference for qualitative samples: roughly 2 million tokens per day at 70B AWQ, or 50 million at 8B FP8.
Scaling triggers. Add a second card if you need parallel fine-tunes (different hyperparameter sweeps simultaneously), if 70B QLoRA wall time becomes the critical path, or if you want a dedicated judge model running alongside a candidate. Step up to 5090 32GB via the 5090 decision page if 14B fine-tunes at sequence 4096 become routine and you need the headroom.
Pitfalls. Forgetting to enable gradient checkpointing on long-sequence runs (OOM at the 4,000th token). Forgetting paged AdamW on 70B QLoRA (immediate OOM). Mixing FP16 and BF16 mid-training. Not snapshotting the dataset commit hash. Letting the 4090 run unthrottled in summer (thermal cap kicks in around 83 degrees C — see thermal performance). Using the wrong AWQ kernel (awq vs awq_marlin); the legacy kernel halves throughput.
Budget envelope and grant fit
A monthly 4090 rental sits well inside a typical departmental project budget and avoids capital procurement delays — no purchase order, no finance forms, no cap-ex amortisation calculations. Power, cooling, networking and a public IP are bundled. See our monthly cost breakdown for the headline figure and the cost-per-token framing in tokens per watt. The card is rentable monthly, so a 6-month grant slot fits cleanly without long-term commitment. If you need more VRAM later, the 4090 vs 5090 decision page covers the upgrade path.
Verdict and decision criteria
A single RTX 4090 24GB is the right hardware for a research lab if: you are 1-3 active researchers, your fine-tunes top out at 14B FP16 LoRA or 70B QLoRA, your eval suites fit a few hours of wall time per run, your inference baselines include 70B at AWQ INT4 but are not high-throughput frontline serving, and you value queue-free dedicated availability over absolute peak performance. It is the wrong choice if you need full-parameter SFT above 7B (consider H100 80GB), if your training runs need 32k+ context routinely (consider 5090 32GB), or if you have a multi-researcher shared workload that genuinely benefits from cluster-scale parallel jobs. For first-day setup follow the vLLM setup tutorial and the first day checklist.
One card for an entire research line
70B inference, 14B fine-tunes, full eval suites and no queues. UK dedicated hosting.
Order the RTX 4090 24GBSee also: Llama 70B INT4 use case, LoRA fine-tune guide, QLoRA tutorial, spec breakdown, monthly cost, 70B deployment, fine-tune throughput.