Fine-tuning Llama 3.3 70B on a single RTX 5090 32GB sounds like it should not work. With QLoRA – 4-bit base weights plus trainable LoRA adapters – it is routine on our dedicated GPU hosting. Here is the config that works.
Contents
Why It Works
QLoRA keeps base weights frozen in 4-bit (via bitsandbytes NF4). The 70B base takes ~40 GB normally but only ~15 GB in 4-bit. LoRA adapters on top of attention layers add maybe 1 GB of trainable parameters. Combined with gradient checkpointing, activation memory drops enough that 32 GB can hold everything for a 70B fine-tune – at reduced training speed.
Setup
pip install torch transformers peft bitsandbytes accelerate trl datasets
Training Config
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.3-70B-Instruct",
quantization_config=bnb,
device_map="auto",
)
lora = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj","k_proj","v_proj","o_proj"],
bias="none", task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
cfg = SFTConfig(
output_dir="./out",
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=2e-4,
bf16=True,
gradient_checkpointing=True,
optim="paged_adamw_8bit",
num_train_epochs=3,
)
Expected Time
On a single 5090, QLoRA on Llama 3.3 70B runs roughly 1,000-3,000 training tokens/second depending on sequence length. For 10k samples at 2k tokens each, one epoch takes 2-6 hours. Three epochs fits comfortably in an overnight run.
Single-GPU QLoRA Fine-Tuning
UK dedicated 5090 servers with CUDA, PyTorch, and bitsandbytes preconfigured.
Browse GPU ServersFor alternative training recipes see Unsloth on 4060 Ti (faster) and Axolotl (config-driven).