Stable Video Diffusion (SVD) takes a still image and animates it into a short clip, and on a single RTX 4090 24GB it is the only consumer-class GPU that runs the model end-to-end without aggressive offloading. SVD-XT renders 25 frames at 1024 by 576 in roughly 25 seconds at FP16, dropping to 18 seconds when the transformer is FP8-quantised, with peak VRAM landing at about 14 GB and 9 GB respectively. This benchmark documents per-clip latency, the activation memory profile, batched throughput limits, FP8 fidelity trade-offs and the operational gotchas teams hit when they move from a Jupyter notebook to a customer-facing rendering service on UK dedicated GPU hosting. If you need only the headline numbers, jump to the latency and VRAM tables; the methodology section explains how each figure was produced so you can reproduce them.
Contents
- Why SVD is hard on consumer GPUs
- Methodology and test rig
- VRAM map per precision
- Per-clip latency
- Batched throughput limits
- Resolution and frame-count sweep
- Reference configuration
- Production gotchas
Why SVD is hard on consumer GPUs
Image diffusion models like SDXL or FLUX render a single 2D latent of shape roughly 128 by 128 by 4. SVD’s temporal U-Net renders a 4D latent of shape (frames, channels, height, width), which means activation memory scales linearly with the number of frames. A 25-frame 1024 by 576 clip carries roughly 6.5 GB of activations alone in FP16, on top of the U-Net weights, the CLIP image encoder, the temporal attention layers and the VAE that decodes the final latent volume. That is why a 16 GB card cannot fit SVD-XT comfortably even with offload, and why the 4090’s 24 GB headroom is the entry point for native batched rendering. Compare with the 4090 spec breakdown and the GDDR6X bandwidth analysis for the underlying memory geometry.
SVD also stresses VRAM bandwidth more than image diffusion. The temporal attention block reads the entire frame stack on every layer to compute frame-to-frame consistency, and the 4090’s 1008 GB/s of GDDR6X is what keeps that step fast. A 3090 with similar VRAM but lower bandwidth runs the same clip about 1.5x slower, despite identical weights.
Methodology and test rig
All numbers come from a stock 4090 Founders Edition (450 W TDP, no power cap) on the standard test rig: Ryzen 9 7950X, 64 GB DDR5-5600, Samsung 990 Pro 2 TB Gen 4 NVMe, Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6. Software stack is Diffusers 0.30, PyTorch 2.5, FlashAttention 2.6, xformers 0.0.28. We use the default StableVideoDiffusionPipeline in float16 with sdpa attention, motion bucket 127, augmentation strength 0.02. FP8 numbers use the optimum-quanto path that quantises the U-Net only, leaving CLIP and VAE in FP16. Each run is preceded by three warmup clips to populate caches; reported timings are the median of ten subsequent runs. Power is sampled via NVML at 100 ms cadence; observed average draw during denoising sits at 405-420 W, briefly peaking at 440 W during VAE decode.
# Reference SVD-XT launch (FP16)
from diffusers import StableVideoDiffusionPipeline
import torch
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt",
torch_dtype=torch.float16, variant="fp16"
).to("cuda")
pipe.unet.enable_forward_chunking(chunk_size=1)
frames = pipe(image, num_frames=25, num_inference_steps=25,
motion_bucket_id=127, noise_aug_strength=0.02).frames[0]
VRAM map per precision
This is the breakdown for a single 25-frame 1024 by 576 SVD-XT clip at peak (during temporal attention plus the first VAE decode tile). FP8 saves roughly 3 GB on the U-Net and slightly trims activations because intermediate tensors round-trip through smaller buffers.
| Component | FP16 | FP8 transformer |
|---|---|---|
| U-Net + temporal layers | 3.1 GB | 1.6 GB |
| CLIP image encoder | 1.5 GB | 1.5 GB |
| VAE decode (25-frame tile) | 2.0 GB | 2.0 GB |
| Activations (25 frames, 1024×576) | 6.5 GB | 5.0 GB |
| CUDA scratch + workspace | 0.9 GB | 0.9 GB |
| Peak resident | 14.0 GB | 11.0 GB |
Per-clip latency
Wall-clock end-to-end including image preprocessing and VAE decode. Numbers are the median of ten runs after warmup.
| Model | Frames | Resolution | Steps | FP16 latency | FP8 latency |
|---|---|---|---|---|---|
| SVD (14-frame) | 14 | 1024 x 576 | 25 | 14 s | 10 s |
| SVD-XT | 25 | 1024 x 576 | 25 | 25 s | 18 s |
| SVD-XT 1.1 | 25 | 1024 x 576 | 30 | 32 s | 23 s |
| SVD-XT | 25 | 1280 x 720 | 25 | 38 s | 27 s |
| SVD-XT (49-frame extended) | 49 | 1024 x 576 | 25 | 52 s | 37 s |
FP8 reduces wall time by 28-32% across the board with a perceptual quality difference that is visible only on side-by-side A/B comparison of fine textures (skin pores, fabric weave). For most marketing and storytelling use cases FP8 is the right default; reserve FP16 for hero shots or licensed content where you cannot tolerate any fidelity drop.
VAE decode share
The VAE is responsible for roughly 18-22% of total latency on a 25-frame clip and is essentially serial across frames. Tiling the decode (8 frames per tile) trims another 5% off wall time and drops VAE peak by 700 MB at the cost of slightly more PCIe traffic between tiles.
Batched throughput limits
SVD activations are large enough that batching beyond two clips OOMs a 24 GB card unless you drop frame count or resolution. The table below holds resolution at 1024 by 576 and varies clips per batch.
| Batch (clips) | Latency (FP16) | s/clip | VRAM | Notes |
|---|---|---|---|---|
| 1 | 25 s | 25.0 | 14.0 GB | Reference |
| 2 | 44 s | 22.0 | 21.5 GB | Tight; OK with cap |
| 3 | OOM | – | – | Even with offload |
| 2 (FP8 transformer) | 34 s | 17.0 | 17.5 GB | Sweet spot |
| 4 (FP8 + 14 frames) | 49 s | 12.3 | 20.5 GB | For preview pipelines |
Batch 2 in FP8 is the realistic operational maximum on a 24 GB card; you save roughly 12% per clip versus single-shot FP8 at the cost of slightly higher latency variance. For previews and thumbnails, dropping to 14 frames lets you batch 4 at FP8 and produces six-second draft animations at 12 seconds per clip.
Resolution and frame-count sweep
Diffusers does not enforce a fixed resolution for SVD; quality degrades visibly above 1280 by 720 because the model was trained on the 1024 by 576 tile. We include the higher resolutions as a reference for users who want to upscale-then-render.
| Frames | Resolution | FP16 latency | VRAM |
|---|---|---|---|
| 14 | 1024 x 576 | 14 s | 10.0 GB |
| 25 | 1024 x 576 | 25 s | 14.0 GB |
| 25 | 1280 x 720 | 38 s | 18.5 GB |
| 49 | 1024 x 576 | 52 s | 22.0 GB (tight) |
| 25 | 1536 x 864 | 58 s | OOM at FP16 |
For the 49-frame extended setting you need --gpu-memory-utilization-style guard rails: pre-allocate nothing else on the card, and ensure CUDA scratch has not been bloated by an earlier process. We have seen the 49-frame run OOM intermittently when a co-resident image generator left fragmentation behind.
Reference configuration
The recipe we ship to customers for production SVD on a single 4090 looks like this. The key flags are enable_forward_chunking on the U-Net (frees ~700 MB), VAE tiling at 8 frames per tile, and disabling the safety checker for backend rendering (it costs 350 ms and ~600 MB).
# Production SVD-XT FP8 launch with sane defaults
from diffusers import StableVideoDiffusionPipeline
from optimum.quanto import quantize, qfloat8
import torch
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt",
torch_dtype=torch.float16, variant="fp16"
).to("cuda")
quantize(pipe.unet, weights=qfloat8)
pipe.unet.enable_forward_chunking(chunk_size=1)
pipe.vae.enable_tiling(tile_sample_min_size=512)
# Reference: 25 frames, 25 steps, 1024x576, motion bucket 127
For broader diffusion context, see the Stable Diffusion setup guide, the ComfyUI setup, and the FLUX setup guide. Throughput cross-references live in the SDXL benchmark and FLUX dev benchmark.
Production gotchas
- Don’t enable
enable_model_cpu_offloadby default. The 4090 has the VRAM to keep all SVD components resident. CPU offload adds 4-7 seconds per clip due to PCIe round trips and is only worth it if you are co-hosting another large model. - Watch VAE decode peak. Peak VRAM is reached during the VAE pass, not denoising. If
nvidia-smishows headroom mid-render, do not assume it will hold; the last 10% of the run spikes hardest. - Motion bucket 127 is a sane default, not a recipe. Higher values (180-220) increase motion magnitude but also increase artefact rate, especially face wobble. Tune per content type and test with a held-out validation set.
- FP8 quantisation needs Ada or newer. The numbers above use the native FP8 path on Ada. On a 3090 you must fall back to INT8, which loses 1-2 PSNR points on temporal coherence.
- Power capping helps thermal stability. A 380 W cap (set via
nvidia-smi -pl 380) costs ~5% on per-clip latency and drops sustained junction temp by 6-8 C, which matters when the chassis sits at 30 C ambient. See the power draw post and the thermal performance writeup. - Watch HuggingFace cache on first launch. The full SVD-XT FP16 weights are ~9.5 GB; first download takes 90-180 seconds on a Gigabit link. Stage to persistent NVMe under
HF_HOMEor your container restarts will be slow. - Concurrency requires careful scheduling. Two simultaneous SVD-XT runs starting in the same CUDA context fragment the allocator and can OOM the second job near VAE decode. Serialise in a single-worker queue or run two replicas in separate processes with isolated CUDA contexts.
Verdict: when to pick the 4090 for SVD
If you need single-card SVD-XT inference with native FP16 fit and reproducible 18-25 second per-clip latency, the 4090 is currently the only consumer choice that is comfortable rather than borderline. A single 4090 produces roughly 200 SVD-XT FP8 clips per hour or 4,800 per day at 24/7 utilisation; that comfortably backs a marketing-scale animation product or a creative studio’s short-form pipeline. For larger frame counts (49+) or 720p output, you are better off renting a 5090 or 6000 Pro and living with the higher capex; see the 4090 vs 5090 decision and the 4090 vs 6000 Pro piece. For longer, character-driven animation, AnimateDiff plus SDXL is a more economical pipeline at this VRAM tier and is documented in our broader image generation studio guide.
Generate video clips on a single 4090
SVD-XT in 18-25 seconds, 14 GB peak. UK dedicated hosting.
Order the RTX 4090 24GBSee also: Stable Diffusion setup, ComfyUI setup, FLUX setup, FLUX dev benchmark, FLUX schnell benchmark, SDXL benchmark, image generation studio, best GPU for Stable Diffusion.