Symptom: Images Taking 30+ Seconds Per Generation
Your GPU server has a capable NVIDIA card, but Stable Diffusion takes 30 seconds or more to produce a single 512×512 image. The GPU shows activity in nvidia-smi, but utilisation bounces between 40-70% instead of staying pegged at 95%+. On a modern GPU, SD 1.5 should generate in 2-5 seconds and SDXL in 5-15 seconds at standard resolutions.
Diagnose the Bottleneck
import torch
import time
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
# Warm up
pipe("warmup", num_inference_steps=1)
torch.cuda.synchronize()
# Benchmark
start = time.time()
pipe("a photo of a cat", num_inference_steps=30)
torch.cuda.synchronize()
elapsed = time.time() - start
print(f"Generation time: {elapsed:.2f}s")
print(f"Per step: {elapsed/30*1000:.0f}ms")
If per-step time exceeds 100ms on an RTX 3090 for SD 1.5, something is misconfigured.
Fix 1: Ensure FP16 Precision
Running in FP32 halves throughput on consumer GPUs that have 1:2 FP32:FP16 ratio:
# Check current dtype
print(pipe.unet.dtype) # Should be torch.float16
# If FP32, reload correctly
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
variant="fp16"
).to("cuda")
Fix 2: Enable xformers or SDPA
Memory-efficient attention does not just save VRAM; it also accelerates generation by 20-40%:
# Option A: xformers (recommended for maximum speed)
pip install xformers
pipe.enable_xformers_memory_efficient_attention()
# Option B: PyTorch Scaled Dot Product Attention
# Automatic in PyTorch 2.0+ with diffusers, but verify:
import torch
print(f"SDPA available: {hasattr(torch.nn.functional, 'scaled_dot_product_attention')}")
# Force SDPA backend selection
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
Our PyTorch installation guide covers matching xformers to your CUDA version.
Fix 3: torch.compile for Persistent Speedup
PyTorch 2.0’s compile function generates optimised CUDA kernels for your specific model and GPU:
# Compile the UNet (main compute bottleneck)
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
# First generation is slow (compilation), subsequent are fast
pipe("warmup", num_inference_steps=1) # Triggers compilation
# Now benchmark
start = time.time()
pipe("a photo of a mountain", num_inference_steps=30)
torch.cuda.synchronize()
print(f"Compiled generation: {time.time()-start:.2f}s")
Expect 10-30% speedup after compilation. The initial compilation takes 1-3 minutes but the cached kernels persist for the session.
Fix 4: Use Faster Schedulers With Fewer Steps
Modern schedulers produce quality output in far fewer steps than the original DDPM:
from diffusers import LCMScheduler, DPMSolverMultistepScheduler
# DPM++ 2M Karras: good quality at 20 steps
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
pipe.scheduler.config,
algorithm_type="dpmsolver++",
use_karras_sigmas=True
)
pipe("landscape", num_inference_steps=20)
# LCM: usable output at 4-8 steps (with LCM-LoRA)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5")
pipe("landscape", num_inference_steps=6, guidance_scale=1.5)
Dropping from 50 to 20 steps with DPM++ gives 2.5x speedup with negligible quality loss. LCM at 6 steps provides near-realtime generation.
Production Speed Targets
On an RTX 5090 with all optimisations: SD 1.5 at 512×512 in 1-2 seconds, SDXL at 1024×1024 in 4-8 seconds. On an RTX 6000 Pro 96 GB: similar speeds with headroom for batching. For Stable Diffusion hosting at scale, ComfyUI pipelines with model caching outperform A1111 for batch workflows. Check the benchmarks for GPU-specific timings, our CUDA guide for driver optimisation, and the Docker GPU guide for containerised deployments. The tutorials section covers PyTorch compilation and profiling in depth.
Fast GPU Servers for Image Generation
GigaGPU RTX 5090 and RTX 6000 Pro servers deliver sub-5-second Stable Diffusion generation out of the box.
Browse GPU Servers