Table of Contents
XTTS-v2 Overview
XTTS-v2 from Coqui is an advanced text-to-speech model with built-in voice cloning from short audio samples. It supports 17 languages and produces natural-sounding speech with emotion and prosody control. With approximately 470M parameters, it sits between lightweight models like Kokoro and heavyweight models like Bark. For self-hosted XTTS-v2 hosting on a dedicated GPU server, understanding the VRAM profile is critical for co-hosting with other models.
VRAM Requirements by Precision
| Precision | Model Weights | Generation Overhead | Total VRAM |
|---|---|---|---|
| FP32 | ~1.9 GB | ~1.5 GB | ~3.4 GB |
| FP16 / BF16 | ~1.0 GB | ~1.0 GB | ~2.0 GB |
| INT8 | ~0.5 GB | ~0.8 GB | ~1.3 GB |
XTTS-v2 at FP16 uses approximately 2 GB of VRAM during generation. This includes the GPT-2 style autoregressive decoder, the HiFi-GAN vocoder, and the speaker embedding encoder. The generation overhead accounts for intermediate tensors during the autoregressive decoding loop.
Voice Cloning VRAM Overhead
XTTS-v2’s voice cloning feature extracts speaker embeddings from a reference audio clip (6+ seconds recommended). This adds a small VRAM spike during embedding extraction but does not increase steady-state VRAM during generation.
| Operation | Additional VRAM (FP16) | Duration |
|---|---|---|
| Speaker embedding extraction | ~0.3 GB temporary | ~0.5s |
| Generation with cloned voice | ~0 GB (embedding cached) | N/A |
| Multiple voice cache (10 voices) | ~0.01 GB | N/A |
Speaker embeddings are tiny (a few KB each) and can be pre-computed and cached. Running XTTS-v2 with 10+ cached voices adds negligible VRAM. For speed comparisons, see the TTS latency benchmarks.
GPU Recommendations
| GPU | VRAM | XTTS-v2 Capability | Real-Time Factor |
|---|---|---|---|
| RTX 3050 | 6 GB | FP16 + voice cloning, 4 GB free | ~3x |
| RTX 4060 | 8 GB | FP16 + co-hosting, 6 GB free | ~5x |
| RTX 4060 Ti | 16 GB | FP16 + multi-model, 14 GB free | ~6x |
| RTX 3090 | 24 GB | FP16 + full pipeline, 22 GB free | ~8x |
XTTS-v2 runs comfortably on any GPU with 6+ GB of VRAM. The RTX 4060 is the sweet spot, leaving 6 GB free for a co-hosted LLM or Whisper model.
Comparison with Bark and Kokoro
| Feature | XTTS-v2 | Bark | Kokoro |
|---|---|---|---|
| FP16 VRAM | ~2 GB | ~6 GB | ~0.4 GB |
| Voice Cloning | Yes (6s sample) | Limited (preset voices) | No |
| Languages | 17 | 13 | Limited |
| Speed (RTF) | 3-8x | 0.8-1.5x | 20-33x |
| Non-Speech Audio | No | Yes | No |
XTTS-v2 is the best choice when you need voice cloning and multilingual support. For raw speed, choose Kokoro. For creative audio including music and effects, choose Bark. See our Kokoro VRAM guide for the lightweight alternative.
Deployment Recommendations
Deploy XTTS-v2 for applications requiring personalised voices: audiobook narration, virtual assistants with branded voices, or multilingual content creation. Pair it with LLaMA 3 for script generation and XTTS-v2 for synthesis. On a single RTX 4060, you can run both models simultaneously.
Use the GPU comparisons tool to evaluate hardware. Estimate costs with the cost calculator. Browse all deployment guides in the model guides section.
Host XTTS-v2 on Dedicated GPUs
Run XTTS-v2 with voice cloning on dedicated GPU servers. Clone any voice from a 6-second sample with no API limits.
Browse GPU Servers