Stable Video Diffusion (SVD) from Stability AI is still the most widely deployed open-weight image-to-video model. SVD-XT takes a single conditioning image and produces a 25-frame clip at 1024×576. VRAM is tight but very much achievable on 16 GB consumer GPUs, and comfortable on 24 GB and above. This guide covers setup on a dedicated GPU server, per-GPU generation times and the gotchas that will waste your day if you miss them.
Contents
- VRAM budget
- Which GPUs fit
- Install and first clip
- Per-GPU generation times
- Quality and throughput tips
- Alternatives to SVD
VRAM budget
SVD-XT is a ~1.5B parameter UNet with an attached temporal VAE. FP16 weights are around 2.9 GB. The real VRAM cost is activations and the VAE decode step, where the 25-frame batch briefly pushes memory above 12 GB. In practice you want 14-16 GB available.
| Stage | FP16 VRAM | FP8 VRAM |
|---|---|---|
| UNet weights | 2.9 GB | 1.5 GB |
| UNet activations (25 frames) | 7.5 GB | 5.0 GB |
| VAE decode peak | 3.2 GB | 3.2 GB |
| CLIP + overhead | 1.4 GB | 1.4 GB |
| Total peak | ~12 GB | ~8 GB |
Which GPUs fit
| GPU | VRAM | 25-frame SVD-XT | Notes |
|---|---|---|---|
| RTX 3060 12GB | 12 GB | Tight, tiled VAE | Works with vae_tiling |
| RTX 4060 Ti 16GB | 16 GB | Fits | Comfortable |
| RTX 5060 Ti 16GB | 16 GB | Fits, FP8 option | Best value |
| RTX 5080 16GB | 16 GB | Comfortable | Fastest 16 GB option |
| RTX 3090 24GB | 24 GB | Fits batch of 2 | Legacy workhorse |
| RTX 5090 32GB | 32 GB | Ideal | Batch 3-4 possible |
| RTX 6000 Pro 96GB | 96 GB | Batch 10+ | Studio workflows |
Install and first clip
We recommend the diffusers pipeline; it is more maintainable than the reference Stability repo. Python 3.11, CUDA 12.4, PyTorch 2.6.
pip install torch==2.6.0 diffusers==0.30 transformers accelerate safetensors
python -c "
from diffusers import StableVideoDiffusionPipeline
import torch
from PIL import Image
p = StableVideoDiffusionPipeline.from_pretrained(
'stabilityai/stable-video-diffusion-img2vid-xt',
torch_dtype=torch.float16, variant='fp16').to('cuda')
p.enable_model_cpu_offload()
img = Image.open('input.png').resize((1024, 576))
frames = p(img, num_frames=25, decode_chunk_size=8).frames[0]
"
decode_chunk_size=8 is the single most important flag: it prevents VAE OOM on 12-16 GB cards. On a 5090 you can crank it to 25 for full-batch decode.
Per-GPU generation times
| GPU | 25-frame clip (s) | Clips/hour |
|---|---|---|
| RTX 3060 12GB | 95 | 38 |
| RTX 4060 Ti 16GB | 62 | 58 |
| RTX 5060 Ti 16GB | 48 | 75 |
| RTX 3090 24GB | 41 | 87 |
| RTX 5080 16GB | 32 | 112 |
| RTX 5090 32GB | 22 | 163 |
| RTX 6000 Pro 96GB | 19 | 189 |
The 5090 is roughly 2x the throughput of a 3090 for the same job and uses 2.4x less power per clip. The 5060 Ti punches above its price bracket for short-form content studios; see our image-generation studio guide.
Quality and throughput tips
- Use
motion_bucket_idbetween 100 and 180 for product-style motion; go higher for action. - Keep
fps=7for the default SVD-XT look; interpolate with RIFE for 24 fps delivery. enable_model_cpu_offload()adds ~4 s of latency but reliably fits on 12 GB.- FP8 weights (via Torch AO) cut VRAM by 30% on Blackwell.
Rent a GPU server for SVD
RTX 5060 Ti 16GB to RTX 6000 Pro 96GB, on-demand. UK dedicated hosting.
Browse GPU ServersAlternatives to SVD
CogVideoX-5B fits on 24 GB and produces longer clips. HunyuanVideo is much higher quality but needs 30+ GB VRAM; see our HunyuanVideo VRAM guide. Mochi-1 sits between the two for a 12-second output.
See also: SDXL on 5060 Ti, Flux Schnell benchmark, Best GPU for SDXL, upgrading to 5090.