Coqui XTTS v2 is the leading open TTS model for multilingual voice cloning. Numbers on the RTX 5060 Ti 16GB at our hosting:
Contents
Setup
- Coqui TTS 0.22
- Model: XTTS v2 (multilingual, 17 languages)
- Sample rate: 24 kHz, mel 80-band
- FP16 inference, CUDA 12.6
XTTS v2 Throughput (Batch 1)
| Length (output audio) | Gen time | RTF |
|---|---|---|
| 5 sec | 0.85 s | 0.17 |
| 10 sec | 1.25 s | 0.125 |
| 20 sec | 2.20 s | 0.110 |
| 60 sec | 6.10 s | 0.102 |
Real-time factor below 0.2 means you generate audio ~5-10x faster than it plays. Solid for interactive voice assistants.
Batch 4
| Length | Total time (4 items) | Per-item |
|---|---|---|
| 5 sec each | 2.2 s | 0.55 s |
| 10 sec each | 3.4 s | 0.85 s |
Batching 4 cuts per-item time by ~35%. VRAM peak ~6 GB.
Voice Cloning Latency
Provide a 6-second reference clip, generate new speech in cloned voice:
- Speaker encoding (one-time): ~300 ms
- Generation: same as unclones (RTF ~0.1)
For persistent cloned voices, cache the speaker embedding in memory to skip the 300 ms on subsequent calls.
Coqui TTS on Blackwell 16GB
RTF 0.1, voice cloning ready. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Bark TTS, Whisper benchmark, voice pipeline, voice assistant, podcast tools.