CogVideoX 5B from THUDM is a 5B-parameter text-to-video model that produces 6-second 720p clips with better quality than smaller alternatives. On our dedicated GPU hosting it needs a 24 GB+ card for comfortable inference.
Contents
VRAM
~22 GB at FP16 for 5B base. ~12 GB at FP8. CPU offload can run it on smaller cards at significant speed cost.
Deployment
from diffusers import CogVideoXPipeline
import torch
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b",
torch_dtype=torch.bfloat16,
).to("cuda")
video = pipe(
prompt="A timelapse of clouds forming over a mountain valley",
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
).frames[0]
6 seconds of video at 720p requires the full 50-step schedule for best quality. Reducing to 30 steps saves time with a small quality cost.
Performance
| GPU | 6s 720p @ 50 steps |
|---|---|
| 3090 24GB | ~8-10 min |
| 5090 32GB | ~4-5 min |
| 6000 Pro 96GB | ~3-4 min |
Image-to-Video
CogVideoX ships an image-to-video variant (CogVideoX-5b-I2V) that takes a starting frame plus a motion prompt. Output coherence is noticeably better than pure text-to-video. Recommended for product workflows.
Self-Hosted Video Model Hosting
CogVideoX on UK dedicated GPU servers with 24GB+ VRAM.
Browse GPU ServersSee LTX Video (faster alternative) and Hunyuan Video.