Table of Contents
TTS Throughput Overview
Text-to-speech is a core component of voice agents, accessibility tools, and audio content platforms. When serving TTS at scale on a dedicated GPU server, you need to know how many requests per second each GPU can handle before latency becomes unacceptable. We benchmarked three popular open-source TTS models across six GPUs to provide concrete capacity planning data.
All tests ran on GigaGPU bare-metal servers. Each request synthesised approximately 30 words of English text (~3 seconds of output audio). We measured sustained requests per second and per-request latency at p50, p90, and p99 percentiles. For voice pipeline latency, see the voice agent latency benchmark.
Kokoro TTS Throughput by GPU
Kokoro is a lightweight, low-latency TTS model that prioritises speed — ideal for real-time voice agents.
| GPU | Requests/sec | p50 Latency | p90 Latency | p99 Latency |
|---|---|---|---|---|
| RTX 3050 (6 GB) | 5.2 | 185 ms | 210 ms | 240 ms |
| RTX 4060 (8 GB) | 10.5 | 92 ms | 105 ms | 120 ms |
| RTX 4060 Ti (16 GB) | 14.8 | 65 ms | 75 ms | 88 ms |
| RTX 3090 (24 GB) | 22.0 | 44 ms | 50 ms | 58 ms |
| RTX 5080 (16 GB) | 30.5 | 32 ms | 36 ms | 42 ms |
| RTX 5090 (32 GB) | 42.0 | 23 ms | 26 ms | 30 ms |
The RTX 5090 handles 42 Kokoro TTS requests per second with sub-30 ms latency — effectively invisible in a voice pipeline. Even the RTX 4060 manages 10.5 req/s with under 120 ms latency, which is acceptable for most applications.
Bark TTS Throughput by GPU
Bark produces high-quality, expressive audio but requires significantly more compute than Kokoro.
| GPU | Requests/sec | p50 Latency | p90 Latency | p99 Latency |
|---|---|---|---|---|
| RTX 3050 (6 GB) | 0.18 | 5,200 ms | 5,600 ms | 6,100 ms |
| RTX 4060 (8 GB) | 0.40 | 2,400 ms | 2,650 ms | 2,900 ms |
| RTX 4060 Ti (16 GB) | 0.58 | 1,680 ms | 1,850 ms | 2,050 ms |
| RTX 3090 (24 GB) | 0.92 | 1,050 ms | 1,150 ms | 1,280 ms |
| RTX 5080 (16 GB) | 1.35 | 720 ms | 790 ms | 880 ms |
| RTX 5090 (32 GB) | 2.10 | 460 ms | 510 ms | 570 ms |
Bark is 15-20x slower than Kokoro. On the RTX 3090, it delivers under 1 request per second — usable for batch audio generation but too slow for real-time voice agents. The RTX 5090 at 2.1 req/s is borderline for interactive use.
XTTS v2 Throughput by GPU
XTTS v2 supports voice cloning and produces natural speech at moderate latency.
| GPU | Requests/sec | p50 Latency | p90 Latency | p99 Latency |
|---|---|---|---|---|
| RTX 3050 (6 GB) | 0.65 | 1,480 ms | 1,620 ms | 1,800 ms |
| RTX 4060 (8 GB) | 1.30 | 740 ms | 820 ms | 910 ms |
| RTX 4060 Ti (16 GB) | 1.85 | 525 ms | 580 ms | 645 ms |
| RTX 3090 (24 GB) | 2.80 | 345 ms | 380 ms | 425 ms |
| RTX 5080 (16 GB) | 3.90 | 248 ms | 275 ms | 305 ms |
| RTX 5090 (32 GB) | 5.60 | 172 ms | 190 ms | 215 ms |
XTTS v2 sits between Kokoro and Bark in both quality and speed. On the RTX 3090 at 2.8 req/s and 345 ms latency, it is usable for near-real-time voice applications with voice cloning.
Per-Request Latency Comparison
Choosing between TTS models is fundamentally a latency-quality trade-off. Kokoro delivers sub-50 ms on mid-range GPUs — ideal for real-time voice agents where speed matters more than expressiveness. XTTS v2 provides voice cloning at 250-750 ms, suitable for personalised but not fully real-time use. Bark produces the most expressive audio but at 1-5 second latency, limiting it to batch and offline use.
For voice agent pipelines, Kokoro is the default recommendation because the TTS stage needs to be nearly invisible. See the voice agent latency benchmark for full pipeline numbers. For capacity planning across all workload types, see the GPU capacity planning for AI SaaS guide. Use the LLM cost calculator to model total costs.
Conclusion
TTS throughput varies enormously by model: from 42 req/s (Kokoro on RTX 5090) to 0.18 req/s (Bark on RTX 3050). For real-time voice agents, Kokoro on an RTX 3090 (22 req/s, 44 ms latency) is the value leader. For voice cloning workloads, XTTS v2 on the RTX 5080 delivers good throughput at manageable latency. Browse all speech and audio benchmarks in the Benchmarks category at GigaGPU.