Home / Blog / Model Guides / XTTS-v2 VRAM Requirements

Model Guides

XTTS-v2 VRAM Requirements

Complete VRAM breakdown for Coqui's XTTS-v2 text-to-speech model covering all precision levels, voice cloning overhead, GPU recommendations, and comparison with other TTS models.

Model Guides April 14, 2026 2 min read gigagpu

Table of Contents

XTTS-v2 Overview
VRAM Requirements by Precision
Voice Cloning VRAM Overhead
GPU Recommendations
Comparison with Bark and Kokoro
Deployment Recommendations

XTTS-v2 Overview

XTTS-v2 from Coqui is an advanced text-to-speech model with built-in voice cloning from short audio samples. It supports 17 languages and produces natural-sounding speech with emotion and prosody control. With approximately 470M parameters, it sits between lightweight models like Kokoro and heavyweight models like Bark. For self-hosted XTTS-v2 hosting on a dedicated GPU server, understanding the VRAM profile is critical for co-hosting with other models.

VRAM Requirements by Precision

Precision	Model Weights	Generation Overhead	Total VRAM
FP32	~1.9 GB	~1.5 GB	~3.4 GB
FP16 / BF16	~1.0 GB	~1.0 GB	~2.0 GB
INT8	~0.5 GB	~0.8 GB	~1.3 GB

XTTS-v2 at FP16 uses approximately 2 GB of VRAM during generation. This includes the GPT-2 style autoregressive decoder, the HiFi-GAN vocoder, and the speaker embedding encoder. The generation overhead accounts for intermediate tensors during the autoregressive decoding loop.

Voice Cloning VRAM Overhead

XTTS-v2’s voice cloning feature extracts speaker embeddings from a reference audio clip (6+ seconds recommended). This adds a small VRAM spike during embedding extraction but does not increase steady-state VRAM during generation.

Operation	Additional VRAM (FP16)	Duration
Speaker embedding extraction	~0.3 GB temporary	~0.5s
Generation with cloned voice	~0 GB (embedding cached)	N/A
Multiple voice cache (10 voices)	~0.01 GB	N/A

Speaker embeddings are tiny (a few KB each) and can be pre-computed and cached. Running XTTS-v2 with 10+ cached voices adds negligible VRAM. For speed comparisons, see the TTS latency benchmarks.

GPU Recommendations

GPU	VRAM	XTTS-v2 Capability	Real-Time Factor
RTX 3050	6 GB	FP16 + voice cloning, 4 GB free	~3x
RTX 4060	8 GB	FP16 + co-hosting, 6 GB free	~5x
RTX 4060 Ti	16 GB	FP16 + multi-model, 14 GB free	~6x
RTX 3090	24 GB	FP16 + full pipeline, 22 GB free	~8x

XTTS-v2 runs comfortably on any GPU with 6+ GB of VRAM. The RTX 4060 is the sweet spot, leaving 6 GB free for a co-hosted LLM or Whisper model.

Comparison with Bark and Kokoro

Feature	XTTS-v2	Bark	Kokoro
FP16 VRAM	~2 GB	~6 GB	~0.4 GB
Voice Cloning	Yes (6s sample)	Limited (preset voices)	No
Languages	17	13	Limited
Speed (RTF)	3-8x	0.8-1.5x	20-33x
Non-Speech Audio	No	Yes	No

XTTS-v2 is the best choice when you need voice cloning and multilingual support. For raw speed, choose Kokoro. For creative audio including music and effects, choose Bark. See our Kokoro VRAM guide for the lightweight alternative.

Deployment Recommendations

Deploy XTTS-v2 for applications requiring personalised voices: audiobook narration, virtual assistants with branded voices, or multilingual content creation. Pair it with LLaMA 3 for script generation and XTTS-v2 for synthesis. On a single RTX 4060, you can run both models simultaneously.

Use the GPU comparisons tool to evaluate hardware. Estimate costs with the cost calculator. Browse all deployment guides in the model guides section.

Host XTTS-v2 on Dedicated GPUs

Run XTTS-v2 with voice cloning on dedicated GPU servers. Clone any voice from a 6-second sample with no API limits.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

XTTS-v2 VRAM Requirements

XTTS-v2 Overview

VRAM Requirements by Precision

Voice Cloning VRAM Overhead

GPU Recommendations

Comparison with Bark and Kokoro

Deployment Recommendations

Host XTTS-v2 on Dedicated GPUs

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

XTTS-v2 VRAM Requirements

XTTS-v2 Overview

VRAM Requirements by Precision

Voice Cloning VRAM Overhead

GPU Recommendations

Comparison with Bark and Kokoro

Deployment Recommendations

Host XTTS-v2 on Dedicated GPUs

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 5060 Ti 16GB for Phi-3-medium

RTX 5060 Ti 16GB for CodeLlama 13B

Self-Hosted Vision-Language Model Comparison: Qwen-VL, Llama 3.2 Vision, Pixtral

Whisper for Real-Time Transcription: GPU Sizing and Latency Budget

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?