RTX 3050 - Order Now
Home / Blog / Benchmarks / How Many TTS Requests per Second per GPU?
Benchmarks

How Many TTS Requests per Second per GPU?

Text-to-speech throughput benchmarks — requests per second across six GPUs for Kokoro, Bark, and XTTS v2, with p50/p90/p99 latency per request.

TTS Throughput Overview

Text-to-speech is a core component of voice agents, accessibility tools, and audio content platforms. When serving TTS at scale on a dedicated GPU server, you need to know how many requests per second each GPU can handle before latency becomes unacceptable. We benchmarked three popular open-source TTS models across six GPUs to provide concrete capacity planning data.

All tests ran on GigaGPU bare-metal servers. Each request synthesised approximately 30 words of English text (~3 seconds of output audio). We measured sustained requests per second and per-request latency at p50, p90, and p99 percentiles. For voice pipeline latency, see the voice agent latency benchmark.

Kokoro TTS Throughput by GPU

Kokoro is a lightweight, low-latency TTS model that prioritises speed — ideal for real-time voice agents.

GPURequests/secp50 Latencyp90 Latencyp99 Latency
RTX 3050 (6 GB)5.2185 ms210 ms240 ms
RTX 4060 (8 GB)10.592 ms105 ms120 ms
RTX 4060 Ti (16 GB)14.865 ms75 ms88 ms
RTX 3090 (24 GB)22.044 ms50 ms58 ms
RTX 5080 (16 GB)30.532 ms36 ms42 ms
RTX 5090 (32 GB)42.023 ms26 ms30 ms

The RTX 5090 handles 42 Kokoro TTS requests per second with sub-30 ms latency — effectively invisible in a voice pipeline. Even the RTX 4060 manages 10.5 req/s with under 120 ms latency, which is acceptable for most applications.

Bark TTS Throughput by GPU

Bark produces high-quality, expressive audio but requires significantly more compute than Kokoro.

GPURequests/secp50 Latencyp90 Latencyp99 Latency
RTX 3050 (6 GB)0.185,200 ms5,600 ms6,100 ms
RTX 4060 (8 GB)0.402,400 ms2,650 ms2,900 ms
RTX 4060 Ti (16 GB)0.581,680 ms1,850 ms2,050 ms
RTX 3090 (24 GB)0.921,050 ms1,150 ms1,280 ms
RTX 5080 (16 GB)1.35720 ms790 ms880 ms
RTX 5090 (32 GB)2.10460 ms510 ms570 ms

Bark is 15-20x slower than Kokoro. On the RTX 3090, it delivers under 1 request per second — usable for batch audio generation but too slow for real-time voice agents. The RTX 5090 at 2.1 req/s is borderline for interactive use.

XTTS v2 Throughput by GPU

XTTS v2 supports voice cloning and produces natural speech at moderate latency.

GPURequests/secp50 Latencyp90 Latencyp99 Latency
RTX 3050 (6 GB)0.651,480 ms1,620 ms1,800 ms
RTX 4060 (8 GB)1.30740 ms820 ms910 ms
RTX 4060 Ti (16 GB)1.85525 ms580 ms645 ms
RTX 3090 (24 GB)2.80345 ms380 ms425 ms
RTX 5080 (16 GB)3.90248 ms275 ms305 ms
RTX 5090 (32 GB)5.60172 ms190 ms215 ms

XTTS v2 sits between Kokoro and Bark in both quality and speed. On the RTX 3090 at 2.8 req/s and 345 ms latency, it is usable for near-real-time voice applications with voice cloning.

Per-Request Latency Comparison

Choosing between TTS models is fundamentally a latency-quality trade-off. Kokoro delivers sub-50 ms on mid-range GPUs — ideal for real-time voice agents where speed matters more than expressiveness. XTTS v2 provides voice cloning at 250-750 ms, suitable for personalised but not fully real-time use. Bark produces the most expressive audio but at 1-5 second latency, limiting it to batch and offline use.

For voice agent pipelines, Kokoro is the default recommendation because the TTS stage needs to be nearly invisible. See the voice agent latency benchmark for full pipeline numbers. For capacity planning across all workload types, see the GPU capacity planning for AI SaaS guide. Use the LLM cost calculator to model total costs.

Conclusion

TTS throughput varies enormously by model: from 42 req/s (Kokoro on RTX 5090) to 0.18 req/s (Bark on RTX 3050). For real-time voice agents, Kokoro on an RTX 3090 (22 req/s, 44 ms latency) is the value leader. For voice cloning workloads, XTTS v2 on the RTX 5080 delivers good throughput at manageable latency. Browse all speech and audio benchmarks in the Benchmarks category at GigaGPU.

Size Your GPU Server

Tell us your workload — we’ll recommend the right GPU.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?