Text-to-speech is one of the most per-character-expensive AI services on the market: ElevenLabs, PlayHT and WellSaid all bill in the high-tens-of-dollars-per-million-characters range. Hosting your own TTS stack on the RTX 5060 Ti 16GB via UK dedicated GPU hosting runs Coqui XTTS v2 at RTF 0.1 – a 5-second clip synthesised in 0.85 seconds – with voice cloning from a 6-second reference on one Blackwell card.
Contents
Model line-up
Three TTS families cover essentially every production use case. All fit comfortably in 16 GB and can run as separate processes behind a router that picks on requested voice, language and style.
| Model | VRAM | Licence | Strength |
|---|---|---|---|
| Coqui XTTS v2 | 2.1 GB | CPML (non-commercial base; commercial via API) | Zero-shot voice cloning, 17 languages |
| Parler-TTS large | 4.8 GB | Apache 2.0 | Description-controlled voices |
| MeloTTS | 0.8 GB | MIT | Fast multilingual, low VRAM |
| Piper (CPU fallback) | – | MIT | Ultra-low latency, local voices |
| StyleTTS 2 | 1.6 GB | MIT | Expressive English, diffusion |
Speed and real-time factor
| Model | RTF | 5-sec clip | 30-sec clip |
|---|---|---|---|
| XTTS v2 | 0.10 | 0.85 s | 3.1 s |
| MeloTTS | 0.05 | 0.25 s | 1.5 s |
| Parler-TTS large | 0.35 | 1.8 s | 10.4 s |
| StyleTTS 2 | 0.08 | 0.42 s | 2.3 s |
See our Coqui TTS benchmark for the full profile. RTF well below 1.0 means you generate faster than playback, which is the precondition for barge-in-capable voice agents.
Voice cloning
XTTS v2 clones from a single six-second reference clip and preserves speaker identity across 17 languages. On one 5060 Ti the clone-and-synthesise latency for a 10-second output is under 2 seconds, making real-time brand-voice generation feasible for interactive apps. Store reference embeddings per tenant, not raw audio, to minimise data-protection exposure.
Concurrency
For streaming chatbot-style apps where generation only needs to stay ahead of playback, RTF determines concurrency. XTTS v2 at RTF 0.1 supports roughly 10 concurrent streams before playback starves; MeloTTS at RTF 0.05 supports 20. Under pure batch (podcast generation, audiobook rendering), one card processes around 36,000 seconds of audio per hour.
| Workload | XTTS v2 | MeloTTS |
|---|---|---|
| Concurrent streaming voices | 10 | 20 |
| Batch audio-hours/day | 240 | 480 |
| Per-voice switching overhead | ~80 ms | ~20 ms |
Cost vs ElevenLabs
| Volume | ElevenLabs | Self-hosted 5060 Ti |
|---|---|---|
| 1M chars/month | $220 (£173) | Fixed monthly |
| 10M chars/month | $990 (£779) | Fixed monthly |
| 100M chars/month | $6,600 (£5,190) | Fixed monthly |
| 500M chars/month (audiobooks) | $30,000+ (£23,600+) | Fixed monthly |
Break-even against ElevenLabs Creator tier lands around 3M characters/month (roughly 50 hours of narration); above that, self-hosting is cheaper, private and removes per-character metering from the product architecture.
Private TTS API on Blackwell 16GB
XTTS v2 voice cloning at RTF 0.1. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: voice pipeline setup, Whisper API setup, STT API, startup MVP.