Symptom: VAE Decode Fails or Produces Garbage
Your Stable Diffusion pipeline generates latents successfully, but the VAE decode step crashes or outputs corrupted images. The errors take several forms on your GPU server:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.FloatTensor) should be the same
Or the pipeline completes without errors but the image has colour banding, washed-out patches, or a magenta tint. All of these point to VAE configuration problems.
Diagnose the VAE Issue
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
# Check VAE dtype and device
print(f"VAE dtype: {pipe.vae.dtype}")
print(f"VAE device: {pipe.vae.device}")
# Generate latents and test decode separately
latents = pipe("test", output_type="latent", num_inference_steps=5).images
print(f"Latent dtype: {latents.dtype}, shape: {latents.shape}")
# Manual VAE decode to isolate the error
with torch.no_grad():
decoded = pipe.vae.decode(latents / pipe.vae.config.scaling_factor)
print(f"Decoded: min={decoded.sample.min():.3f}, max={decoded.sample.max():.3f}")
Fix 1: Resolve dtype Mismatch
The most common VAE error is a precision mismatch between the UNet (FP16) and VAE (FP32):
# Force both to the same precision
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
# If VAE reverts to FP32, force it explicitly
pipe.vae = pipe.vae.to(dtype=torch.float16)
# Or keep VAE in FP32 and cast latents before decode
# (better quality, more VRAM)
pipe.vae = pipe.vae.to(dtype=torch.float32)
# The pipeline handles the cast automatically in newer diffusers versions
Fix 2: Use the FP16-Fixed VAE for SDXL
SDXL’s default VAE is notorious for producing NaN values in FP16. A community-maintained fix exists:
from diffusers import AutoencoderKL, StableDiffusionXLPipeline
import torch
# Load the FP16-safe SDXL VAE
vae = AutoencoderKL.from_pretrained(
"madebyollin/sdxl-vae-fp16-fix",
torch_dtype=torch.float16
).to("cuda")
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
vae=vae,
torch_dtype=torch.float16
).to("cuda")
Always use this VAE when running SDXL in FP16. The original VAE only works reliably in FP32, which doubles memory usage.
Fix 3: Load Custom VAE Correctly
When using a custom or third-party VAE (common with A1111 checkpoints), the loading path matters:
# From a standalone VAE file
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_single_file(
"path/to/vae-ft-mse-840000.safetensors",
torch_dtype=torch.float16
).to("cuda")
# Attach to existing pipeline
pipe.vae = vae
# From a checkpoint that bundles a different VAE
pipe = StableDiffusionPipeline.from_single_file(
"path/to/model.safetensors",
torch_dtype=torch.float16,
load_safety_checker=False
).to("cuda")
# Verify the loaded VAE config
print(pipe.vae.config)
Fix 4: Enable VAE Tiling for High Resolution
At resolutions above 1024×1024, the VAE can run out of memory even when the UNet fits:
# Process high-res images in tiles
pipe.enable_vae_tiling()
pipe.enable_vae_slicing()
# Generate at high resolution
image = pipe("landscape", height=2048, width=2048).images[0]
VAE tiling splits the decode into patches, each fitting comfortably in VRAM. For Stable Diffusion hosting with complex VAE workflows, ComfyUI lets you wire VAE components individually. Check our PyTorch guide for compatible PyTorch versions, the CUDA installation guide for driver dependencies, and the benchmarks for VAE decode performance across GPUs. The tutorials section covers more debugging techniques.
GPU Servers for Stable Diffusion
GigaGPU dedicated servers with pre-installed CUDA and high-VRAM GPUs for reliable image generation pipelines.
Browse GPU Servers