Table of Contents
The TTS Landscape in 2025
Open-source text-to-speech has reached production quality. Models like Coqui XTTS-v2, Bark, and Kokoro TTS deliver natural-sounding speech with voice cloning, emotion control, and multilingual support. But TTS inference is latency-sensitive: users expect near-instant audio generation, which means your GPU choice directly affects user experience.
We benchmarked three leading TTS models across six GPUs available on GigaGPU dedicated servers to find the best hardware for every budget. Full interactive results are on our TTS latency benchmarks page.
TTS Latency Benchmarks by GPU
We generated a standardised 30-second speech clip (~75 words) and measured end-to-end generation time including model inference and vocoder pass. Lower latency means faster audio output.
Coqui XTTS-v2
| GPU | VRAM | Latency (30s clip) | RTF | Server $/hr |
|---|---|---|---|---|
| RTX 5090 | 32 GB | 2.1 sec | 0.07 | $1.80 |
| RTX 5090 | 24 GB | 2.8 sec | 0.09 | $1.10 |
| RTX 6000 Pro | 48 GB | 3.2 sec | 0.11 | $1.30 |
| RTX 5080 | 16 GB | 3.6 sec | 0.12 | $0.85 |
| RTX 3090 | 24 GB | 5.4 sec | 0.18 | $0.45 |
| RTX 4060 | 8 GB | 11.7 sec | 0.39 | $0.20 |
Bark (Large, with Voice Cloning)
| GPU | Latency (30s clip) | RTF | Notes |
|---|---|---|---|
| RTX 5090 | 5.4 sec | 0.18 | Fastest option |
| RTX 5090 | 7.5 sec | 0.25 | Good for real-time |
| RTX 6000 Pro | 8.4 sec | 0.28 | Pro-grade |
| RTX 5080 | 9.9 sec | 0.33 | Fits, decent speed |
| RTX 3090 | 16.5 sec | 0.55 | Slower but works |
| RTX 4060 | OOM | — | Bark Large needs ~10 GB |
Kokoro TTS
| GPU | Latency (30s clip) | RTF | Notes |
|---|---|---|---|
| RTX 5090 | 0.8 sec | 0.03 | Near-instant |
| RTX 5090 | 1.1 sec | 0.04 | Excellent |
| RTX 5080 | 1.4 sec | 0.05 | Very fast |
| RTX 6000 Pro | 1.3 sec | 0.04 | Excellent |
| RTX 3090 | 2.1 sec | 0.07 | Still very good |
| RTX 4060 | 4.8 sec | 0.16 | Acceptable |
Kokoro is the lightest and fastest model, generating 30 seconds of speech in 2.1 seconds on the RTX 3090. Bark is the most demanding, with the 3090 producing an RTF of 0.55 (slower than real-time). XTTS sits in the middle, offering good quality with real-time-capable speeds on mid-range hardware.
Real-Time Factor Comparison
For voice AI applications, the critical threshold is RTF < 1.0 (generating audio faster than playback). For streaming TTS in a voice agent, you want RTF < 0.3 to leave room for STT, LLM processing, and network latency.
| GPU | XTTS-v2 RTF | Bark RTF | Kokoro RTF | Suitable for Streaming? |
|---|---|---|---|---|
| RTX 5090 | 0.07 | 0.18 | 0.03 | Yes (all models) |
| RTX 5090 | 0.09 | 0.25 | 0.04 | Yes (all models) |
| RTX 5080 | 0.12 | 0.33 | 0.05 | Yes (XTTS/Kokoro), marginal (Bark) |
| RTX 6000 Pro | 0.11 | 0.28 | 0.04 | Yes (all models) |
| RTX 3090 | 0.18 | 0.55 | 0.07 | Yes (XTTS/Kokoro), No (Bark streaming) |
| RTX 4060 | 0.39 | OOM | 0.16 | Kokoro only |
The RTX 3090 handles XTTS and Kokoro with comfortable real-time margins. Bark is the outlier: its autoregressive architecture makes it 3-4x slower than XTTS, pushing the 3090 past the streaming threshold. If Bark is your model of choice, you need at least an RTX 5080 or 5090. For Whisper integration, see our best GPU for Whisper guide.
VRAM Requirements per Model
| TTS Model | VRAM (Inference) | Min GPU |
|---|---|---|
| Kokoro TTS | ~2 GB | RTX 4060 (8 GB) |
| Coqui XTTS-v2 | ~4 GB | RTX 4060 (8 GB) |
| Coqui XTTS-v2 + voice cloning | ~5 GB | RTX 4060 (8 GB) |
| Bark (Small) | ~5 GB | RTX 4060 (8 GB) |
| Bark (Large) | ~10 GB | RTX 3090 (24 GB) |
| Bark Large + speaker history | ~12 GB | RTX 3090 (24 GB) |
TTS models are relatively lightweight on VRAM compared to LLMs. The real question is what else you need on the same GPU. A voice agent pipeline typically runs Whisper + LLM + TTS on one card, and those combined VRAM needs push you toward 24 GB minimum.
Cost Efficiency: Audio Hours per Dollar
For batch TTS workloads (audiobook generation, dataset creation, content dubbing), cost per hour of generated audio matters most.
| GPU | XTTS Hours/$1 | Bark Hours/$1 | Kokoro Hours/$1 |
|---|---|---|---|
| RTX 3090 | 12.3 hrs | 4.0 hrs | 31.7 hrs |
| RTX 4060 | 12.8 hrs | OOM | 31.3 hrs |
| RTX 5080 | 9.8 hrs | 3.6 hrs | 23.5 hrs |
| RTX 5090 | 10.1 hrs | 3.6 hrs | 22.7 hrs |
| RTX 5090 | 7.9 hrs | 3.1 hrs | 18.5 hrs |
| RTX 6000 Pro | 7.0 hrs | 2.7 hrs | 19.2 hrs |
The RTX 3090 leads on cost efficiency for XTTS and Bark, generating 12.3 and 4.0 hours of audio per dollar respectively. For Kokoro, the RTX 4060 matches the 3090 because Kokoro’s small size does not benefit from the extra bandwidth. This mirrors the patterns in our cheapest GPU for AI inference rankings.
GPU Requirements for Full Voice Agent Pipelines
A production voice agent runs three models simultaneously: speech-to-text, an LLM, and text-to-speech. Here is the combined VRAM footprint.
| Pipeline | STT Model | LLM | TTS Model | Total VRAM | Min GPU |
|---|---|---|---|---|---|
| Lightweight | Whisper Small (2 GB) | Phi-3 3.8B (8 GB) | Kokoro (2 GB) | ~12 GB | RTX 5080 (16 GB) |
| Balanced | Whisper Large-v3 (5 GB) | Llama 3 8B 4-bit (5 GB) | XTTS-v2 (4 GB) | ~14 GB | RTX 5080 (16 GB, tight) |
| High Quality | Whisper Large-v3 (5 GB) | Llama 3 8B FP16 (16 GB) | XTTS-v2 (4 GB) | ~25 GB | RTX 5090 (32 GB) |
| Best Quality | Whisper Large-v3 (5 GB) | Qwen 2.5 14B 4-bit (9 GB) | Bark Large (10 GB) | ~24 GB | RTX 3090 (24 GB, tight) |
The RTX 3090’s 24 GB VRAM is the sweet spot for voice agent deployments. It can run most pipeline combinations without quantising the LLM. Our build a voice agent server tutorial walks through the full setup on a 3090. For the highest quality pipeline with a large LLM at full precision, step up to the RTX 5090 with 32 GB.
GPU Recommendations
For Kokoro TTS (lightweight, fast):
- Budget: RTX 4060 (RTF 0.16, plenty fast)
- Best value: RTX 3090 (RTF 0.07, room for additional models)
For Coqui XTTS-v2 (best quality/speed balance):
- Best value: RTX 3090 (RTF 0.18, 12.3 audio hrs/$1)
- Best latency: RTX 5090 or RTX 5090 (RTF 0.07-0.09)
For Bark (highest naturalness, slowest):
- Minimum for real-time: RTX 5080 (RTF 0.33)
- Best for streaming: RTX 5090 or RTX 5090 (RTF 0.18-0.25)
- Batch generation: RTX 3090 (cheapest per audio hour)
For voice agent pipelines (STT + LLM + TTS):
- Best all-round: RTX 3090 (24 GB fits most combos)
- Premium: RTX 5090 (32 GB for FP16 LLMs + Bark)
Explore all speech model hosting options and compare GPUs in our GPU comparisons section. If you are also evaluating AMD hardware for voice AI, our AMD vs NVIDIA comparison explains why NVIDIA remains the safer bet for TTS workloads.
Launch a Voice AI Server
Deploy Coqui XTTS, Bark, or Kokoro on a dedicated GPU with pre-configured audio pipelines. Real-time TTS with zero per-character fees.
Browse GPU Servers