Stable Diffusion XL at 1024×1024 is the standard image-gen workload. The RTX 5060 Ti 16GB on our hosting has plenty of headroom. Numbers below.
Contents
Setup
- Diffusers 0.30, PyTorch 2.5, xFormers 0.0.28
- Model: stabilityai/stable-diffusion-xl-base-1.0
- Resolution: 1024×1024
- Sampler: DPM++ 2M Karras
- Backends compared: Diffusers FP16, ComfyUI, Automatic1111
Batch 1, 1024×1024
| Backend | Steps | Time | VRAM peak |
|---|---|---|---|
| Diffusers FP16 | 30 | 3.8 s | 9.2 GB |
| Diffusers + SDPA | 30 | 3.4 s | 9.0 GB |
| Diffusers + torch.compile | 30 | 2.9 s | 9.4 GB |
| ComfyUI (default) | 30 | 3.6 s | 8.9 GB |
| Automatic1111 (xFormers) | 30 | 4.1 s | 9.5 GB |
Roughly 3-4 seconds per 1024×1024 image at 30 steps. At 20 steps (fine for many styles) drop to ~2.5 s.
Batch 4, 1024×1024
- Diffusers FP16: 12.2 s for 4 images (3.05 s / image)
- VRAM peak: 13.8 GB
- Batch 6: OOM at 1024×1024; workable at 896×896
SDXL-Turbo (1-4 step)
| Steps | Time | Images/min |
|---|---|---|
| 1 | 0.28 s | 214 |
| 4 | 0.85 s | 70 |
Turbo is the option for interactive creative tools.
Optimisation Knobs
- torch.compile – 20% speedup, 60s first-run compile cost
- SDPA vs xFormers – native PyTorch SDPA is now faster on Blackwell
- VAE tiling – needed if doing huge resolutions
- FP8 UNet – experimental; ~25% faster, minor quality cost
SDXL on Blackwell 16GB
3-4 s/image at 1024px, batch up to 4. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: FLUX.1 Schnell, SD 1.5 benchmark, SD setup guide, ComfyUI setup, image studio.