RTX 3050 - Order Now
Home / Blog / Tutorials / Stable Diffusion VAE Decode Error Fix
Tutorials

Stable Diffusion VAE Decode Error Fix

Fix VAE decode errors in Stable Diffusion including NaN outputs, checkpoint mismatches, FP16 precision issues, and custom VAE loading problems on GPU servers.

Symptom: VAE Decode Fails or Produces Garbage

Your Stable Diffusion pipeline generates latents successfully, but the VAE decode step crashes or outputs corrupted images. The errors take several forms on your GPU server:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.FloatTensor) should be the same

Or the pipeline completes without errors but the image has colour banding, washed-out patches, or a magenta tint. All of these point to VAE configuration problems.

Diagnose the VAE Issue

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# Check VAE dtype and device
print(f"VAE dtype: {pipe.vae.dtype}")
print(f"VAE device: {pipe.vae.device}")

# Generate latents and test decode separately
latents = pipe("test", output_type="latent", num_inference_steps=5).images
print(f"Latent dtype: {latents.dtype}, shape: {latents.shape}")

# Manual VAE decode to isolate the error
with torch.no_grad():
    decoded = pipe.vae.decode(latents / pipe.vae.config.scaling_factor)
    print(f"Decoded: min={decoded.sample.min():.3f}, max={decoded.sample.max():.3f}")

Fix 1: Resolve dtype Mismatch

The most common VAE error is a precision mismatch between the UNet (FP16) and VAE (FP32):

# Force both to the same precision
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# If VAE reverts to FP32, force it explicitly
pipe.vae = pipe.vae.to(dtype=torch.float16)

# Or keep VAE in FP32 and cast latents before decode
# (better quality, more VRAM)
pipe.vae = pipe.vae.to(dtype=torch.float32)
# The pipeline handles the cast automatically in newer diffusers versions

Fix 2: Use the FP16-Fixed VAE for SDXL

SDXL’s default VAE is notorious for producing NaN values in FP16. A community-maintained fix exists:

from diffusers import AutoencoderKL, StableDiffusionXLPipeline
import torch

# Load the FP16-safe SDXL VAE
vae = AutoencoderKL.from_pretrained(
    "madebyollin/sdxl-vae-fp16-fix",
    torch_dtype=torch.float16
).to("cuda")

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    vae=vae,
    torch_dtype=torch.float16
).to("cuda")

Always use this VAE when running SDXL in FP16. The original VAE only works reliably in FP32, which doubles memory usage.

Fix 3: Load Custom VAE Correctly

When using a custom or third-party VAE (common with A1111 checkpoints), the loading path matters:

# From a standalone VAE file
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_single_file(
    "path/to/vae-ft-mse-840000.safetensors",
    torch_dtype=torch.float16
).to("cuda")

# Attach to existing pipeline
pipe.vae = vae

# From a checkpoint that bundles a different VAE
pipe = StableDiffusionPipeline.from_single_file(
    "path/to/model.safetensors",
    torch_dtype=torch.float16,
    load_safety_checker=False
).to("cuda")

# Verify the loaded VAE config
print(pipe.vae.config)

Fix 4: Enable VAE Tiling for High Resolution

At resolutions above 1024×1024, the VAE can run out of memory even when the UNet fits:

# Process high-res images in tiles
pipe.enable_vae_tiling()
pipe.enable_vae_slicing()

# Generate at high resolution
image = pipe("landscape", height=2048, width=2048).images[0]

VAE tiling splits the decode into patches, each fitting comfortably in VRAM. For Stable Diffusion hosting with complex VAE workflows, ComfyUI lets you wire VAE components individually. Check our PyTorch guide for compatible PyTorch versions, the CUDA installation guide for driver dependencies, and the benchmarks for VAE decode performance across GPUs. The tutorials section covers more debugging techniques.

GPU Servers for Stable Diffusion

GigaGPU dedicated servers with pre-installed CUDA and high-VRAM GPUs for reliable image generation pipelines.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?