RTX 3050 - Order Now
Home / Blog / Tutorials / Gradient Checkpointing VRAM Savings
Tutorials

Gradient Checkpointing VRAM Savings

Gradient checkpointing trades ~25% training speed for ~60% VRAM savings. Often the single setting that decides whether your fine-tune runs.

Gradient checkpointing is a training-only memory optimisation. Instead of storing every intermediate activation for the backward pass, the framework discards most of them and recomputes as needed. On our dedicated GPU hosting it is usually the difference between a 7B full fine-tune fitting a 24 GB card or not.

Contents

Mechanics

During a standard training step the forward pass stores every layer’s activations so the backward pass can compute gradients. For a 32-layer 7B model, activations can consume 8-16 GB at typical batch sizes. Checkpointing saves only the activations at selected layer boundaries (“checkpoints”) and recomputes the rest during backward. Fewer activations = less VRAM. Recomputation = extra compute.

Savings

WorkloadWithoutWithSpeed Cost
Mistral 7B SFT~26 GB~12 GB~25% slower
Llama 3 8B QLoRA~17 GB~11 GB~20% slower
Qwen 14B LoRA~38 GB~18 GB~25% slower

Savings scale with sequence length – longer sequences benefit more because activation memory grows with length.

Enable

In Transformers / TRL:

training_args = SFTConfig(
    ...,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
)

The use_reentrant=False option uses PyTorch’s newer non-reentrant checkpointing – more robust and slightly faster on modern PyTorch versions.

For Unsloth use the framework-specific version:

model = FastLanguageModel.get_peft_model(
    model,
    use_gradient_checkpointing="unsloth",
    ...
)

Tradeoffs

The 25% wall-clock penalty is usually worth it because:

  • You fit larger models on the same GPU
  • You run larger effective batch size by using the freed VRAM
  • Alternative is upgrading to a bigger GPU, which costs more than the 25% speed penalty

Skip checkpointing only when VRAM is abundant and every training second matters – a rare combination on cost-sensitive dedicated hosting.

Fine-Tuning Without VRAM Stress

UK dedicated GPUs sized so you can choose whether to enable checkpointing, not be forced to.

Browse GPU Servers

See Flash Attention 2 setup and BF16 vs FP16 training.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?