LoRA covers most fine-tuning needs but occasionally you want to update every parameter – domain-specific continued pretraining, or when LoRA’s rank is not enough to express the shift you need. On a RTX 6000 Pro 96GB from our dedicated GPU hosting, a full fine-tune of a 7B model is comfortable.
Contents
Memory Budget
Full fine-tune of a 7B model at BF16:
| Component | ~VRAM |
|---|---|
| Weights (BF16) | 14 GB |
| Gradients (BF16) | 14 GB |
| AdamW optimiser (FP32 m+v) | 56 GB |
| Activations | 4-8 GB |
| Total | ~88-92 GB |
Fits a 96 GB card with small batch size and gradient checkpointing. AdamW 8-bit cuts optimiser memory by 4x if you need more headroom for batch size.
Config
from trl import SFTConfig, SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.3",
torch_dtype="bfloat16",
device_map="cuda",
)
cfg = SFTConfig(
output_dir="./out",
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
gradient_checkpointing=True,
learning_rate=5e-6,
bf16=True,
optim="adamw_torch_fused",
num_train_epochs=3,
save_steps=500,
logging_steps=10,
)
trainer = SFTTrainer(model=model, args=cfg, train_dataset=your_dataset)
trainer.train()
Data
Full fine-tune is forgiving of data quantity but unforgiving of quality. A few thousand high-quality examples in your target format typically beats tens of thousands of noisy samples. For continued pretraining, clean domain text without instruction formatting works – for instruction tuning, keep the chat template consistent with the base model.
Training Time
On a 6000 Pro for Mistral 7B full fine-tune: ~5,000-8,000 training tokens/second. 10 million training tokens (roughly 5k samples × 2k tokens) finishes in 20-35 minutes. Three epochs: about 1-2 hours. Much faster than QLoRA on smaller hardware.
Full Fine-Tune Hosting
RTX 6000 Pro UK dedicated servers ready for full parameter fine-tuning.
Browse GPU ServersSee LoRA on Mistral 7B and QLoRA on Llama 3.3.