RTX 3050 - Order Now
Home / Blog / Tutorials / DeepSpeed ZeRO on Dual GPU Servers
Tutorials

DeepSpeed ZeRO on Dual GPU Servers

ZeRO-2 and ZeRO-3 let you train models that would not fit on a single GPU by sharding optimiser state and gradients across cards.

DeepSpeed’s ZeRO (Zero Redundancy Optimiser) shards training state – gradients, optimiser states, and optionally weights – across multiple GPUs. On a dual-GPU dedicated server it is the right tool when a full fine-tune of a 13B+ model exceeds single-card VRAM.

Contents

ZeRO Stages

  • Stage 1: optimiser state sharded across GPUs. Modest savings.
  • Stage 2: optimiser state + gradients sharded. Bigger savings. Weights still replicated.
  • Stage 3: everything sharded including weights. Biggest savings. More communication cost.

Config

A typical ds_config.json for ZeRO-2:

{
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  },
  "bf16": {"enabled": true},
  "gradient_accumulation_steps": 8,
  "gradient_clipping": 1.0,
  "train_batch_size": "auto"
}

CPU offload for the optimiser shifts another 30-50% of memory off the GPU at the cost of PCIe traffic – useful on tight dual-24GB setups.

Launch

deepspeed --num_gpus=2 train.py --deepspeed ds_config.json

Or with Accelerate:

accelerate launch --config_file accel_deepspeed.yaml train.py

Which Stage

SituationStage
7B model on 2× 24 GBZeRO-2
13B model on 2× 24 GBZeRO-2 + CPU offload
13B model on 2× 32 GBZeRO-2
70B full fine-tune on 2× 96 GBZeRO-3
LoRA onlyZeRO-1 (minimal benefit)

Dual-GPU Training Ready

Two-card UK dedicated servers with DeepSpeed preinstalled.

Browse GPU Servers

See FSDP alternative and NCCL tuning.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?