Tencent’s HunyuanVideo is the most capable open-weight text-to-video model of 2025 by some distance. It is also the most demanding. At 13B parameters with a DiT backbone and a 3D VAE, it does not fit on any 16 GB consumer GPU without aggressive offload, and even then inference is painful. This guide lays out the VRAM budget honestly, lists which GPUs actually run it, and gives sensible alternatives if you are stuck at 16 or 24 GB. For the hardware, we stock MI300X, RTX 6000 Pro and H100 on dedicated GPU hosting.
Contents
- VRAM budget at FP16 and FP8
- GPUs that actually run it
- Generation time at 540p and 720p
- Offload and low-VRAM options
- Alternatives under 24 GB
- Quick start
VRAM budget
HunyuanVideo has three heavy components: the DiT transformer (~13B params), the 3D VAE, and a LLaMA-based text encoder. Peak memory is during the final VAE decode of the full latent video, which scales with frame count and resolution.
| Component | FP16 | FP8 | INT4 |
|---|---|---|---|
| DiT weights | 26 GB | 13 GB | 7 GB |
| Text encoder (LLaMA) | 14 GB | 7 GB | 3.5 GB |
| 3D VAE decode peak (129 frames @ 540p) | 12 GB | 12 GB | 12 GB |
| Activations + KV | 8 GB | 6 GB | 4 GB |
| Total peak (all resident) | ~60 GB | ~38 GB | ~27 GB |
| With CPU offload | ~30 GB | ~22 GB | ~16 GB |
GPUs that actually run it
| GPU | VRAM | Runs HunyuanVideo? | Caveats |
|---|---|---|---|
| RTX 5060 Ti 16GB | 16 GB | No (even with offload) | Use CogVideoX instead |
| RTX 4090 / 5080 | 24 / 16 GB | 4090 with INT4+offload, 5080 no | Painful, ~4x slower |
| RTX 3090 / 5090 | 24 / 32 GB | 5090 yes (FP8+offload), 3090 INT4 only | 5090 is the minimum comfortable consumer option |
| RTX 6000 Pro 96GB | 96 GB | Yes, full FP16 | Fastest single-card option |
| H100 80GB | 80 GB | Yes, FP16 | Data-centre cost |
| MI300X 192GB | 192 GB | Yes, batch multiple jobs | Requires ROCm build |
Generation time at 540p and 720p
HunyuanVideo’s reference setting generates 129 frames (approximately 5 seconds at 24 fps). The “4-minute clip at 540p” benchmark frequently quoted refers to wall-clock generation time, not output length.
| GPU | 540p 129-frame gen time | 720p 129-frame |
|---|---|---|
| RTX 5090 32GB (FP8+offload) | ~8 min | ~14 min |
| RTX 6000 Pro 96GB | ~4 min | ~7 min |
| H100 80GB | ~3 min | ~5.5 min |
| MI300X 192GB | ~3.5 min | ~6 min |
| 2x H100 NVLink | ~1.8 min | ~3.2 min |
Offload and low-VRAM options
The community HunyuanVideoGP fork exposes aggressive CPU offload and block-wise quantisation. On a 24 GB RTX 3090 it will technically run at ~45 minutes per 540p clip in INT4, which is usually unworkable for production but fine for experimentation.
pip install diffusers==0.32 accelerate bitsandbytes
# Use INT4 + sequential offload
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.transformer.to(memory_format=torch.channels_last)
Alternatives under 24 GB
- CogVideoX-5B: 12 GB VRAM, 6-second clips, decent quality.
- Stable Video Diffusion XT: 12-14 GB, 25 frames only; see our SVD guide.
- Mochi-1: 480p, 22 GB VRAM, higher quality than SVD.
- Wan 2.1 1.3B: lightweight, fits on 8 GB, short clips.
Need 80+ GB of VRAM for HunyuanVideo?
RTX 6000 Pro 96GB, H100 80GB and MI300X available. UK dedicated hosting.
Browse GPU ServersQuick start on RTX 6000 Pro
git clone https://github.com/Tencent/HunyuanVideo
cd HunyuanVideo
pip install -r requirements.txt
python sample_video.py --video-size 544 960 --video-length 129 \
--infer-steps 50 --prompt "an origami fox walking through snow" \
--save-path ./out
See also: upgrading to RTX 6000 Pro, 5060 Ti to 5090, SVD on a GPU server, Best GPU for SDXL.