ElevenLabs leads on raw voice naturalness. XTTS v2 has caught up on a lot of dimensions and now hosts the strongest open-weight TTS with voice cloning.
ElevenLabs wins on prosody and emotional range. XTTS v2 wins on cost (self-hosted = effectively free), data control, and language coverage. For commercial voice products at scale, XTTS v2 is the right pick once you cross ~£800/mo of ElevenLabs spend.
Quality
- Naturalness: ElevenLabs slightly ahead
- Prosody / emotion: ElevenLabs ahead
- Voice cloning fidelity: comparable
- Multilingual: XTTS covers 17 languages, ElevenLabs ~30
- Latency: similar; self-hosted XTTS in your region wins by 80-150 ms RTT
Cost
- ElevenLabs: ~$0.30 per 1K characters at scale
- XTTS v2 self-hosted (RTX 5060 Ti £119/mo): effectively £0.0001 per 1K characters at moderate utilisation
- Break-even: ~570K characters/month of usage
Verdict
- <500K characters/mo, no privacy concerns: ElevenLabs
- >500K characters/mo: XTTS v2 self-hosted
- Voice cloning required, commercial use: XTTS v2
- Highest quality emotional TTS: ElevenLabs
Bottom line
Self-hosted XTTS wins on cost above modest volume. ElevenLabs wins on quality at the very top end. See Coqui voice assistant guide.