RTX 3050 - Order Now
Home / Blog / Model Guides / Running Stable Video Diffusion on a GPU Server
Model Guides

Running Stable Video Diffusion on a GPU Server

How to deploy Stable Video Diffusion on a GPU server: VRAM budget, per-GPU generation times and a clean setup walkthrough.

Stable Video Diffusion (SVD) from Stability AI is still the most widely deployed open-weight image-to-video model. SVD-XT takes a single conditioning image and produces a 25-frame clip at 1024×576. VRAM is tight but very much achievable on 16 GB consumer GPUs, and comfortable on 24 GB and above. This guide covers setup on a dedicated GPU server, per-GPU generation times and the gotchas that will waste your day if you miss them.

Contents

VRAM budget

SVD-XT is a ~1.5B parameter UNet with an attached temporal VAE. FP16 weights are around 2.9 GB. The real VRAM cost is activations and the VAE decode step, where the 25-frame batch briefly pushes memory above 12 GB. In practice you want 14-16 GB available.

StageFP16 VRAMFP8 VRAM
UNet weights2.9 GB1.5 GB
UNet activations (25 frames)7.5 GB5.0 GB
VAE decode peak3.2 GB3.2 GB
CLIP + overhead1.4 GB1.4 GB
Total peak~12 GB~8 GB

Which GPUs fit

GPUVRAM25-frame SVD-XTNotes
RTX 3060 12GB12 GBTight, tiled VAEWorks with vae_tiling
RTX 4060 Ti 16GB16 GBFitsComfortable
RTX 5060 Ti 16GB16 GBFits, FP8 optionBest value
RTX 5080 16GB16 GBComfortableFastest 16 GB option
RTX 3090 24GB24 GBFits batch of 2Legacy workhorse
RTX 5090 32GB32 GBIdealBatch 3-4 possible
RTX 6000 Pro 96GB96 GBBatch 10+Studio workflows

Install and first clip

We recommend the diffusers pipeline; it is more maintainable than the reference Stability repo. Python 3.11, CUDA 12.4, PyTorch 2.6.

pip install torch==2.6.0 diffusers==0.30 transformers accelerate safetensors
python -c "
from diffusers import StableVideoDiffusionPipeline
import torch
from PIL import Image
p = StableVideoDiffusionPipeline.from_pretrained(
    'stabilityai/stable-video-diffusion-img2vid-xt',
    torch_dtype=torch.float16, variant='fp16').to('cuda')
p.enable_model_cpu_offload()
img = Image.open('input.png').resize((1024, 576))
frames = p(img, num_frames=25, decode_chunk_size=8).frames[0]
"

decode_chunk_size=8 is the single most important flag: it prevents VAE OOM on 12-16 GB cards. On a 5090 you can crank it to 25 for full-batch decode.

Per-GPU generation times

GPU25-frame clip (s)Clips/hour
RTX 3060 12GB9538
RTX 4060 Ti 16GB6258
RTX 5060 Ti 16GB4875
RTX 3090 24GB4187
RTX 5080 16GB32112
RTX 5090 32GB22163
RTX 6000 Pro 96GB19189

The 5090 is roughly 2x the throughput of a 3090 for the same job and uses 2.4x less power per clip. The 5060 Ti punches above its price bracket for short-form content studios; see our image-generation studio guide.

Quality and throughput tips

  • Use motion_bucket_id between 100 and 180 for product-style motion; go higher for action.
  • Keep fps=7 for the default SVD-XT look; interpolate with RIFE for 24 fps delivery.
  • enable_model_cpu_offload() adds ~4 s of latency but reliably fits on 12 GB.
  • FP8 weights (via Torch AO) cut VRAM by 30% on Blackwell.

Rent a GPU server for SVD

RTX 5060 Ti 16GB to RTX 6000 Pro 96GB, on-demand. UK dedicated hosting.

Browse GPU Servers

Alternatives to SVD

CogVideoX-5B fits on 24 GB and produces longer clips. HunyuanVideo is much higher quality but needs 30+ GB VRAM; see our HunyuanVideo VRAM guide. Mochi-1 sits between the two for a 12-second output.

See also: SDXL on 5060 Ti, Flux Schnell benchmark, Best GPU for SDXL, upgrading to 5090.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?