Home / Blog / Tutorials / DeepSpeed ZeRO on Dual GPU Servers

Tutorials

DeepSpeed ZeRO on Dual GPU Servers

ZeRO-2 and ZeRO-3 let you train models that would not fit on a single GPU by sharding optimiser state and gradients across cards.

Tutorials April 23, 2026 1 min read admin

DeepSpeed’s ZeRO (Zero Redundancy Optimiser) shards training state – gradients, optimiser states, and optionally weights – across multiple GPUs. On a dual-GPU dedicated server it is the right tool when a full fine-tune of a 13B+ model exceeds single-card VRAM.

ZeRO stages
Config file
Launching
Which stage

ZeRO Stages

Stage 1: optimiser state sharded across GPUs. Modest savings.
Stage 2: optimiser state + gradients sharded. Bigger savings. Weights still replicated.
Stage 3: everything sharded including weights. Biggest savings. More communication cost.

Config

A typical ds_config.json for ZeRO-2:

{
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  },
  "bf16": {"enabled": true},
  "gradient_accumulation_steps": 8,
  "gradient_clipping": 1.0,
  "train_batch_size": "auto"
}

CPU offload for the optimiser shifts another 30-50% of memory off the GPU at the cost of PCIe traffic – useful on tight dual-24GB setups.

Launch

deepspeed --num_gpus=2 train.py --deepspeed ds_config.json

Or with Accelerate:

accelerate launch --config_file accel_deepspeed.yaml train.py

Which Stage

Situation	Stage
7B model on 2× 24 GB	ZeRO-2
13B model on 2× 24 GB	ZeRO-2 + CPU offload
13B model on 2× 32 GB	ZeRO-2
70B full fine-tune on 2× 96 GB	ZeRO-3
LoRA only	ZeRO-1 (minimal benefit)

Dual-GPU Training Ready

Two-card UK dedicated servers with DeepSpeed preinstalled.

Browse GPU Servers

See FSDP alternative and NCCL tuning.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

DeepSpeed ZeRO on Dual GPU Servers

Contents

ZeRO Stages

Config

Launch

Which Stage

Dual-GPU Training Ready

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

DeepSpeed ZeRO on Dual GPU Servers

Contents

ZeRO Stages

Config

Launch

Which Stage

Dual-GPU Training Ready

Need a Dedicated GPU Server?

admin

Related Articles

Scaling vLLM Across Two GPUs – What Actually Changes

vLLM on ROCm: Setup Guide for AMD GPUs (MI300X, RX 7900 XTX)

OpenAI SDK with Self-Hosted Models: Node.js Guide

vLLM KV Cache Block Size Tuning

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?