RTX 3050 - Order Now
Home / Blog / Tutorials / QLoRA Fine-Tuning Llama 3.3 70B on RTX 5090
Tutorials

QLoRA Fine-Tuning Llama 3.3 70B on RTX 5090

QLoRA lets you fine-tune a 70B model on a single 32GB GPU. Here is the actual configuration and what to expect in training time.

Fine-tuning Llama 3.3 70B on a single RTX 5090 32GB sounds like it should not work. With QLoRA – 4-bit base weights plus trainable LoRA adapters – it is routine on our dedicated GPU hosting. Here is the config that works.

Contents

Why It Works

QLoRA keeps base weights frozen in 4-bit (via bitsandbytes NF4). The 70B base takes ~40 GB normally but only ~15 GB in 4-bit. LoRA adapters on top of attention layers add maybe 1 GB of trainable parameters. Combined with gradient checkpointing, activation memory drops enough that 32 GB can hold everything for a 70B fine-tune – at reduced training speed.

Setup

pip install torch transformers peft bitsandbytes accelerate trl datasets

Training Config

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.3-70B-Instruct",
    quantization_config=bnb,
    device_map="auto",
)

lora = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
    bias="none", task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)

cfg = SFTConfig(
    output_dir="./out",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4,
    bf16=True,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
    num_train_epochs=3,
)

Expected Time

On a single 5090, QLoRA on Llama 3.3 70B runs roughly 1,000-3,000 training tokens/second depending on sequence length. For 10k samples at 2k tokens each, one epoch takes 2-6 hours. Three epochs fits comfortably in an overnight run.

Single-GPU QLoRA Fine-Tuning

UK dedicated 5090 servers with CUDA, PyTorch, and bitsandbytes preconfigured.

Browse GPU Servers

For alternative training recipes see Unsloth on 4060 Ti (faster) and Axolotl (config-driven).

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?