Stable Video Diffusion (SVD) is Stability AI’s image-to-video model. Feed it a still image and it generates 4 seconds of motion. On our dedicated GPU hosting it is the most established self-hosted image-to-video option.
Contents
VRAM
~11 GB for SVD-XT at FP16. Fits a 16 GB+ card comfortably. With model CPU offload, a 12 GB card works.
Deployment
from diffusers import StableVideoDiffusionPipeline
from PIL import Image
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt",
torch_dtype=torch.float16,
).to("cuda")
image = Image.open("input.png").convert("RGB").resize((1024, 576))
frames = pipe(image, num_frames=25, num_inference_steps=25).frames[0]
Variants
- SVD: 14 frames at 576×1024, ~2s clip
- SVD-XT: 25 frames, ~4s clip, recommended default
- SVD 1.1: refinement with better motion
Limits
SVD is strictly image-to-video – no text prompt input. You control the scene with the input image. Motion type is controlled by motion_bucket_id (0-255) – higher values produce more motion. Fine control over motion is limited; the model chooses what moves.
For text-to-video needs see LTX Video or CogVideoX.
Image-to-Video Hosting
SVD preconfigured on UK dedicated GPU servers, any 16GB+ tier.
Browse GPU ServersSee SVD on GPU server.