Table of Contents
Training vs Inference: VRAM Demands
Training AI models is far more VRAM-intensive than inference. While inference only needs to store the model weights and a small KV cache, training must hold the model, gradients, optimiser states, and activations simultaneously. As a rough rule, training at FP16 requires 3-4x the VRAM of inference. A dedicated GPU server with 24GB gives you real training capability, but you need to know the boundaries.
The RTX 3090 provides 24GB of GDDR6X memory with 936 GB/s bandwidth. For training, the tensor core performance (142 TFLOPS with sparsity) accelerates mixed-precision workflows, and the large VRAM pool keeps batch sizes workable.
What You Can Train on 24GB
| Training Task | Model Size | Method | VRAM Required | Fits RTX 3090? |
|---|---|---|---|---|
| LoRA fine-tune | 7B LLM | QLoRA (4-bit) | ~6 GB | Yes |
| LoRA fine-tune | 13B LLM | QLoRA (4-bit) | ~10 GB | Yes |
| LoRA fine-tune | 70B LLM | QLoRA (4-bit) | ~40 GB | No |
| Full fine-tune | 3B LLM | FP16 + AdamW | ~18 GB | Yes |
| Full fine-tune | 7B LLM | FP16 + AdamW | ~50 GB | No |
| SD LoRA | SD 1.5 | FP16 | ~8 GB | Yes |
| SD LoRA | SDXL | FP16 | ~14 GB | Yes |
| DreamBooth | SD 1.5 | FP16 | ~16 GB | Yes |
| DreamBooth | SDXL | FP16 | ~22 GB | Tight |
The practical ceiling for the RTX 3090 is QLoRA fine-tuning of up to 13B parameter models and full fine-tuning of models up to about 3B parameters. For Stable Diffusion training, the 3090 handles all LoRA and DreamBooth workflows comfortably.
LoRA and QLoRA Fine-Tuning on RTX 3090
LoRA (Low-Rank Adaptation) is the most practical approach for fine-tuning large models on consumer hardware. Instead of updating all model weights, LoRA trains small adapter matrices that capture task-specific knowledge. QLoRA combines this with 4-bit quantisation of the base model, slashing VRAM requirements dramatically.
On the RTX 3090, you can QLoRA fine-tune Llama 3 8B with a batch size of 4 and sequence length of 2048, using around 10-12GB of VRAM. This leaves headroom for gradient checkpointing and longer sequences. For Mistral 7B and DeepSeek models, the story is similar.
Training speed for QLoRA on a 7B model typically ranges from 1,500-2,500 tokens per second on the RTX 3090, making fine-tuning runs of a few thousand examples completable in under an hour.
Full Fine-Tuning Capabilities
Full fine-tuning updates every parameter in the model, requiring VRAM for weights, gradients, and optimiser states. With the AdamW optimiser, you need approximately 16-18 bytes per parameter (2 bytes for FP16 weight, 2 for gradient, 4+4 for optimiser states, plus activations).
This means a 1.5B parameter model fits comfortably for full fine-tuning (around 14GB), and a 3B model is the practical maximum on 24GB with gradient checkpointing enabled. Models like Phi-3 Mini (3.8B) push right up against the limit.
For image model training, the 3090 handles Stable Diffusion 1.5 full fine-tuning and SDXL LoRA training without issues. See the Stable Diffusion VRAM requirements guide for detailed breakdowns.
Memory Optimisation Techniques
Several techniques help maximise what you can train on 24GB. Gradient checkpointing trades compute for memory by recomputing activations during the backward pass instead of storing them. This can reduce activation memory by 60-70% at the cost of about 30% slower training. Mixed-precision training (FP16 with FP32 master weights) is standard and well-supported on the 3090’s tensor cores.
DeepSpeed ZeRO Stage 2 can shard optimiser states across multiple GPUs if you scale to a multi-GPU setup. Gradient accumulation lets you simulate larger batch sizes without additional VRAM. Combined, these techniques push the effective training capacity well beyond what raw memory numbers suggest. Our VRAM cost guide explains the trade-offs in detail.
When 24GB Is Not Enough
The RTX 3090 falls short when you need to full fine-tune models above 3B parameters, or QLoRA fine-tune models above 13B. If your workflow demands training 70B models or running full fine-tunes of 7B+ models, you need either multi-GPU setups or cards with more VRAM like the RTX 5090 (32GB).
For most practitioners, though, the combination of QLoRA for large models and full fine-tuning for small models covers the vast majority of real-world training needs. Use the GPU comparison tools to evaluate whether the 3090 matches your specific training requirements.
Train AI Models on RTX 3090 Servers
Fine-tune Llama, Mistral, and Stable Diffusion models on dedicated RTX 3090 servers. 24GB VRAM with full root access and pre-installed training frameworks.
Browse GPU Servers