Home / Blog / Tutorials / Gradient Checkpointing VRAM Savings

Tutorials

Gradient Checkpointing VRAM Savings

Gradient checkpointing trades ~25% training speed for ~60% VRAM savings. Often the single setting that decides whether your fine-tune runs.

Tutorials April 23, 2026 2 min read admin

Gradient checkpointing is a training-only memory optimisation. Instead of storing every intermediate activation for the backward pass, the framework discards most of them and recomputes as needed. On our dedicated GPU hosting it is usually the difference between a 7B full fine-tune fitting a 24 GB card or not.

How it works
Expected savings
How to enable it
Tradeoffs

Mechanics

During a standard training step the forward pass stores every layer’s activations so the backward pass can compute gradients. For a 32-layer 7B model, activations can consume 8-16 GB at typical batch sizes. Checkpointing saves only the activations at selected layer boundaries (“checkpoints”) and recomputes the rest during backward. Fewer activations = less VRAM. Recomputation = extra compute.

Savings

Workload	Without	With	Speed Cost
Mistral 7B SFT	~26 GB	~12 GB	~25% slower
Llama 3 8B QLoRA	~17 GB	~11 GB	~20% slower
Qwen 14B LoRA	~38 GB	~18 GB	~25% slower

Savings scale with sequence length – longer sequences benefit more because activation memory grows with length.

Enable

In Transformers / TRL:

training_args = SFTConfig(
    ...,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
)

The use_reentrant=False option uses PyTorch’s newer non-reentrant checkpointing – more robust and slightly faster on modern PyTorch versions.

For Unsloth use the framework-specific version:

model = FastLanguageModel.get_peft_model(
    model,
    use_gradient_checkpointing="unsloth",
    ...
)

Tradeoffs

The 25% wall-clock penalty is usually worth it because:

You fit larger models on the same GPU
You run larger effective batch size by using the freed VRAM
Alternative is upgrading to a bigger GPU, which costs more than the 25% speed penalty

Skip checkpointing only when VRAM is abundant and every training second matters – a rare combination on cost-sensitive dedicated hosting.

Fine-Tuning Without VRAM Stress

UK dedicated GPUs sized so you can choose whether to enable checkpointing, not be forced to.

Browse GPU Servers

See Flash Attention 2 setup and BF16 vs FP16 training.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Gradient Checkpointing VRAM Savings

Contents

Mechanics

Savings

Enable

Tradeoffs

Fine-Tuning Without VRAM Stress

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Gradient Checkpointing VRAM Savings

Contents

Mechanics

Savings

Enable

Tradeoffs

Fine-Tuning Without VRAM Stress

Need a Dedicated GPU Server?

admin

Related Articles

Self-Hosted OpenAI-Compatible API Guide

Function Calling with Llama 3.3 – Complete Guide

CrewAI Multi-Agent on a Dedicated GPU

Flux.1 Generation Errors: Common Fixes

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?