FSDP (Fully Sharded Data Parallel) is PyTorch’s native equivalent to DeepSpeed ZeRO-3. On our dedicated GPU hosting it is often the simpler choice – no external config file, first-party PyTorch, well-supported by Accelerate and TRL.
Contents
FSDP vs DeepSpeed
| Concern | FSDP | DeepSpeed ZeRO |
|---|---|---|
| Integration | Native PyTorch | External library |
| Config complexity | Lower | Higher (JSON config) |
| CPU offload | Supported | Supported, more tunable |
| Ecosystem | Growing | Established |
| Llama training | Works well | Works well |
Configuration
With Accelerate, generate an FSDP config:
accelerate config
Select FSDP, pick sharding strategy (FULL_SHARD for equivalent to ZeRO-3), auto-wrap by transformer block, BF16 mixed precision. The generated file looks like:
distributed_type: FSDP
mixed_precision: bf16
num_processes: 2
fsdp_config:
fsdp_sharding_strategy: FULL_SHARD
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_offload_params: false
fsdp_state_dict_type: SHARDED_STATE_DICT
Auto-Wrap
FSDP wraps units that it shards. For Transformers the right unit is one transformer block. Set fsdp_transformer_layer_cls_to_wrap to the decoder layer class name – LlamaDecoderLayer for Llama/Mistral, Qwen2DecoderLayer for Qwen. Wrong wrap class means FSDP wraps too coarsely or too finely, hurting throughput.
Which
Start with FSDP unless you have specific DeepSpeed features (ZeRO-Infinity disk offload, specific optimisers). FSDP is first-party, simpler, and the PyTorch team is adding features steadily. For multi-node training DeepSpeed still has more features.
FSDP-Ready Dual-GPU Servers
UK dedicated multi-GPU hosting with PyTorch and Accelerate preconfigured.
Browse GPU ServersSee DeepSpeed ZeRO and full fine-tune on 6000 Pro.