Table of Contents
TTS Benchmark Update Overview
Text-to-speech latency is the critical metric for voice agents and real-time applications. Users perceive delays above 500ms as sluggish, and voice conversations become unnatural above 1 second of synthesis delay. This April 2026 benchmark update captures the latest performance data for open-source TTS models on dedicated GPU servers.
All tests generate 10 seconds of audio from a 50-word English text prompt. For the interactive benchmark tool, visit the TTS latency benchmarks page.
Latency Results by Model and GPU
Total time to generate 10 seconds of audio:
| Model | RTX 3090 | RTX 5090 | RTX 5090 | RTF (RTX 5090) |
|---|---|---|---|---|
| F5-TTS | 2.9 s | 1.8 s | 1.2 s | 0.18 |
| XTTS v2 | 3.8 s | 2.4 s | 1.6 s | 0.24 |
| StyleTTS 2 | 1.4 s | 0.9 s | 0.6 s | 0.09 |
| Bark | 8.5 s | 5.2 s | 3.5 s | 0.52 |
| Piper (CPU) | 0.15 s | 0.15 s | 0.15 s | 0.015 |
StyleTTS 2 and Piper both achieve sub-1-second generation on all tested GPUs. Piper runs on CPU and delivers the lowest latency but with more mechanical output quality.
Streaming First-Chunk Latency
For real-time applications, time to first audio chunk matters more than total generation time. Streaming synthesis begins playback before the full audio is generated:
| Model | First Chunk (RTX 5090) | Streaming Supported |
|---|---|---|
| F5-TTS | 180 ms | Yes (chunked) |
| XTTS v2 | 250 ms | Yes (native) |
| StyleTTS 2 | 95 ms | Yes (sentence-level) |
| Bark | 420 ms | Yes (semantic tokens) |
| Piper | 12 ms | Yes (native) |
StyleTTS 2 delivers the best first-chunk latency among high-quality models. For voice agents targeting sub-200ms audio response, it is the recommended choice on an RTX 5090 or better.
Concurrent Synthesis Throughput
Simultaneous synthesis requests on an RTX 5090:
| Model | 1 Concurrent | 5 Concurrent | 10 Concurrent | VRAM at 10 |
|---|---|---|---|---|
| F5-TTS | 1.8 s | 3.2 s | 5.8 s | 12 GB |
| StyleTTS 2 | 0.9 s | 1.5 s | 2.8 s | 6 GB |
| XTTS v2 | 2.4 s | 4.1 s | 7.5 s | 14 GB |
TTS models scale reasonably under concurrent load. StyleTTS 2 maintains sub-3-second latency even at 10 concurrent sessions, making it suitable for multi-user voice applications on a single GPU.
Voice Agent Round-Trip Impact
In a complete voice agent pipeline (STT + LLM + TTS), TTS adds the final latency component. Using Whisper for STT and LLaMA 70B for reasoning on an RTX 5090, the TTS stage contributes 15-25% of total round-trip time. See the voice agent round-trip latency benchmark for full pipeline measurements.
Minimising TTS latency has an outsized impact on user experience because it is the last stage before the user hears the response. The best TTS models guide covers quality-latency trade-offs for each model.
Build Real-Time Voice AI on Dedicated Hardware
Sub-200ms TTS latency on your own GPU server. No per-character fees, complete voice data privacy.
Browse GPU ServersHardware Recommendations
For dedicated TTS serving, an RTX 3090 handles any model comfortably. For voice agent stacks that share the GPU with an LLM, an RTX 5090 provides enough VRAM and throughput for TTS alongside a 13-27B model. For full-stack voice agents with a 70B LLM, consider a dual GPU setup or an RTX 6000 Pro. Review the voice agent infrastructure cost breakdown and the cheapest GPU guide for budget configurations.
Visit the benchmarks section for additional TTS performance data as new models are released.