RTX 3050 - Order Now
Home / Blog / Model Guides / XTTS-v2 VRAM Requirements
Model Guides

XTTS-v2 VRAM Requirements

Complete VRAM breakdown for Coqui's XTTS-v2 text-to-speech model covering all precision levels, voice cloning overhead, GPU recommendations, and comparison with other TTS models.

XTTS-v2 Overview

XTTS-v2 from Coqui is an advanced text-to-speech model with built-in voice cloning from short audio samples. It supports 17 languages and produces natural-sounding speech with emotion and prosody control. With approximately 470M parameters, it sits between lightweight models like Kokoro and heavyweight models like Bark. For self-hosted XTTS-v2 hosting on a dedicated GPU server, understanding the VRAM profile is critical for co-hosting with other models.

VRAM Requirements by Precision

PrecisionModel WeightsGeneration OverheadTotal VRAM
FP32~1.9 GB~1.5 GB~3.4 GB
FP16 / BF16~1.0 GB~1.0 GB~2.0 GB
INT8~0.5 GB~0.8 GB~1.3 GB

XTTS-v2 at FP16 uses approximately 2 GB of VRAM during generation. This includes the GPT-2 style autoregressive decoder, the HiFi-GAN vocoder, and the speaker embedding encoder. The generation overhead accounts for intermediate tensors during the autoregressive decoding loop.

Voice Cloning VRAM Overhead

XTTS-v2’s voice cloning feature extracts speaker embeddings from a reference audio clip (6+ seconds recommended). This adds a small VRAM spike during embedding extraction but does not increase steady-state VRAM during generation.

OperationAdditional VRAM (FP16)Duration
Speaker embedding extraction~0.3 GB temporary~0.5s
Generation with cloned voice~0 GB (embedding cached)N/A
Multiple voice cache (10 voices)~0.01 GBN/A

Speaker embeddings are tiny (a few KB each) and can be pre-computed and cached. Running XTTS-v2 with 10+ cached voices adds negligible VRAM. For speed comparisons, see the TTS latency benchmarks.

GPU Recommendations

GPUVRAMXTTS-v2 CapabilityReal-Time Factor
RTX 30506 GBFP16 + voice cloning, 4 GB free~3x
RTX 40608 GBFP16 + co-hosting, 6 GB free~5x
RTX 4060 Ti16 GBFP16 + multi-model, 14 GB free~6x
RTX 309024 GBFP16 + full pipeline, 22 GB free~8x

XTTS-v2 runs comfortably on any GPU with 6+ GB of VRAM. The RTX 4060 is the sweet spot, leaving 6 GB free for a co-hosted LLM or Whisper model.

Comparison with Bark and Kokoro

FeatureXTTS-v2BarkKokoro
FP16 VRAM~2 GB~6 GB~0.4 GB
Voice CloningYes (6s sample)Limited (preset voices)No
Languages1713Limited
Speed (RTF)3-8x0.8-1.5x20-33x
Non-Speech AudioNoYesNo

XTTS-v2 is the best choice when you need voice cloning and multilingual support. For raw speed, choose Kokoro. For creative audio including music and effects, choose Bark. See our Kokoro VRAM guide for the lightweight alternative.

Deployment Recommendations

Deploy XTTS-v2 for applications requiring personalised voices: audiobook narration, virtual assistants with branded voices, or multilingual content creation. Pair it with LLaMA 3 for script generation and XTTS-v2 for synthesis. On a single RTX 4060, you can run both models simultaneously.

Use the GPU comparisons tool to evaluate hardware. Estimate costs with the cost calculator. Browse all deployment guides in the model guides section.

Host XTTS-v2 on Dedicated GPUs

Run XTTS-v2 with voice cloning on dedicated GPU servers. Clone any voice from a 6-second sample with no API limits.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?