Gradient checkpointing is a training-only memory optimisation. Instead of storing every intermediate activation for the backward pass, the framework discards most of them and recomputes as needed. On our dedicated GPU hosting it is usually the difference between a 7B full fine-tune fitting a 24 GB card or not.
Contents
Mechanics
During a standard training step the forward pass stores every layer’s activations so the backward pass can compute gradients. For a 32-layer 7B model, activations can consume 8-16 GB at typical batch sizes. Checkpointing saves only the activations at selected layer boundaries (“checkpoints”) and recomputes the rest during backward. Fewer activations = less VRAM. Recomputation = extra compute.
Savings
| Workload | Without | With | Speed Cost |
|---|---|---|---|
| Mistral 7B SFT | ~26 GB | ~12 GB | ~25% slower |
| Llama 3 8B QLoRA | ~17 GB | ~11 GB | ~20% slower |
| Qwen 14B LoRA | ~38 GB | ~18 GB | ~25% slower |
Savings scale with sequence length – longer sequences benefit more because activation memory grows with length.
Enable
In Transformers / TRL:
training_args = SFTConfig(
...,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
)
The use_reentrant=False option uses PyTorch’s newer non-reentrant checkpointing – more robust and slightly faster on modern PyTorch versions.
For Unsloth use the framework-specific version:
model = FastLanguageModel.get_peft_model(
model,
use_gradient_checkpointing="unsloth",
...
)
Tradeoffs
The 25% wall-clock penalty is usually worth it because:
- You fit larger models on the same GPU
- You run larger effective batch size by using the freed VRAM
- Alternative is upgrading to a bigger GPU, which costs more than the 25% speed penalty
Skip checkpointing only when VRAM is abundant and every training second matters – a rare combination on cost-sensitive dedicated hosting.
Fine-Tuning Without VRAM Stress
UK dedicated GPUs sized so you can choose whether to enable checkpointing, not be forced to.
Browse GPU Servers