DeepSpeed’s ZeRO (Zero Redundancy Optimiser) shards training state – gradients, optimiser states, and optionally weights – across multiple GPUs. On a dual-GPU dedicated server it is the right tool when a full fine-tune of a 13B+ model exceeds single-card VRAM.
Contents
ZeRO Stages
- Stage 1: optimiser state sharded across GPUs. Modest savings.
- Stage 2: optimiser state + gradients sharded. Bigger savings. Weights still replicated.
- Stage 3: everything sharded including weights. Biggest savings. More communication cost.
Config
A typical ds_config.json for ZeRO-2:
{
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"bf16": {"enabled": true},
"gradient_accumulation_steps": 8,
"gradient_clipping": 1.0,
"train_batch_size": "auto"
}
CPU offload for the optimiser shifts another 30-50% of memory off the GPU at the cost of PCIe traffic – useful on tight dual-24GB setups.
Launch
deepspeed --num_gpus=2 train.py --deepspeed ds_config.json
Or with Accelerate:
accelerate launch --config_file accel_deepspeed.yaml train.py
Which Stage
| Situation | Stage |
|---|---|
| 7B model on 2× 24 GB | ZeRO-2 |
| 13B model on 2× 24 GB | ZeRO-2 + CPU offload |
| 13B model on 2× 32 GB | ZeRO-2 |
| 70B full fine-tune on 2× 96 GB | ZeRO-3 |
| LoRA only | ZeRO-1 (minimal benefit) |
Dual-GPU Training Ready
Two-card UK dedicated servers with DeepSpeed preinstalled.
Browse GPU ServersSee FSDP alternative and NCCL tuning.