Home / Blog / Tutorials / QLoRA Fine-Tuning Llama 3.3 70B on RTX 5090

Tutorials

QLoRA Fine-Tuning Llama 3.3 70B on RTX 5090

QLoRA lets you fine-tune a 70B model on a single 32GB GPU. Here is the actual configuration and what to expect in training time.

Tutorials April 19, 2026 2 min read gigagpu

Fine-tuning Llama 3.3 70B on a single RTX 5090 32GB sounds like it should not work. With QLoRA – 4-bit base weights plus trainable LoRA adapters – it is routine on our dedicated GPU hosting. Here is the config that works.

Why QLoRA works
Environment setup
Training config
Expected time

Why It Works

QLoRA keeps base weights frozen in 4-bit (via bitsandbytes NF4). The 70B base takes ~40 GB normally but only ~15 GB in 4-bit. LoRA adapters on top of attention layers add maybe 1 GB of trainable parameters. Combined with gradient checkpointing, activation memory drops enough that 32 GB can hold everything for a 70B fine-tune – at reduced training speed.

Setup

pip install torch transformers peft bitsandbytes accelerate trl datasets

Training Config

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.3-70B-Instruct",
    quantization_config=bnb,
    device_map="auto",
)

lora = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
    bias="none", task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)

cfg = SFTConfig(
    output_dir="./out",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4,
    bf16=True,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
    num_train_epochs=3,
)

Expected Time

On a single 5090, QLoRA on Llama 3.3 70B runs roughly 1,000-3,000 training tokens/second depending on sequence length. For 10k samples at 2k tokens each, one epoch takes 2-6 hours. Three epochs fits comfortably in an overnight run.

Single-GPU QLoRA Fine-Tuning

UK dedicated 5090 servers with CUDA, PyTorch, and bitsandbytes preconfigured.

Browse GPU Servers

For alternative training recipes see Unsloth on 4060 Ti (faster) and Axolotl (config-driven).

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

QLoRA Fine-Tuning Llama 3.3 70B on RTX 5090

Contents

Why It Works

Setup

Training Config

Expected Time

Single-GPU QLoRA Fine-Tuning

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

QLoRA Fine-Tuning Llama 3.3 70B on RTX 5090

Contents

Why It Works

Setup

Training Config

Expected Time

Single-GPU QLoRA Fine-Tuning

Need a Dedicated GPU Server?

gigagpu

Related Articles

Migrate from Azure OpenAI to Dedicated GPU: Content Moderation Guide

Podcast Transcription: Whisper + Diarization

Migrate from Anthropic to Self-Hosted: Research Assistant Guide

Tuning TTFT P99 on the RTX 5060 Ti 16 GB: Six Things That Actually Move the Number

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?