TTS Latency Benchmarks
Open Source Text-to-Speech Latency by Model & GPU — Real Hardware, Real Numbers
Compare first-audio latency and real-time factor (RTF) for Kokoro, XTTS-v2, Bark, F5-TTS, Chatterbox, Piper and more across GigaGPU’s dedicated GPU lineup. All tests run on UK bare metal servers with no shared resources.
Why TTS Latency Matters
For voice agents, IVR systems, audiobook pipelines and real-time narration, latency is the most important metric — it determines how quickly your users hear a response after sending text. A 50ms difference in time-to-first-audio can mean the difference between a natural conversation and an awkward pause.
We benchmark every open source TTS model on the same hardware, under the same conditions, so you can make an informed GPU choice for your workload. Every test runs on a GigaGPU dedicated GPU server — single-tenant bare metal, NVMe storage, and a dedicated GPU card. No virtualisation overhead, no noisy neighbours.
Below you’ll find first-audio latency (time from API call to first audio chunk), real-time factor (RTF), and throughput data for the most popular open source TTS models across our full GPU range.
All benchmarks run on dedicated single-tenant hardware — no shared GPUs, no throttling, no variance from other workloads.
Key Findings
The headline takeaways from our latest TTS latency benchmark round, covering the most deployed open source text-to-speech models.
Kokoro Is the Speed Champion
At just 82M parameters, Kokoro achieves RTF 0.03 on GPU — a 10-second clip synthesised in ~0.3 seconds. It consistently posts the lowest first-audio latency across every GPU tier, making it the default choice for latency-sensitive voice agents.
XTTS-v2 Trades Speed for Cloning
XTTS-v2’s voice cloning capability adds latency — expect 150–400ms first-audio depending on GPU. For applications where voice identity matters more than raw speed, it remains the leading open source option with 17-language support.
GPU Generation Matters Most
Blackwell 2.0 GPUs (RTX 5090, RTX 5080) deliver 30–40% lower latency than Ampere (RTX 3090) on the same model. For production voice agents targeting sub-200ms first-audio, the RTX 5090 is the clear winner.
RTX 3090 Remains Best Value
The RTX 3090’s 24GB VRAM and 936 GB/s bandwidth still delivers production-grade latency for most TTS models at the lowest cost per synthesised hour. It comfortably runs every model in this benchmark.
First-Audio Latency by Model & GPU
Time from API request to the first audio chunk being returned, in milliseconds. Lower is better. Select a GPU below to highlight its column, or view all at once. Tested with a standard 25-word English input sentence.
| Model | RTX 3050 6 GB |
RTX 4060 8 GB |
RTX 4060 Ti 16 GB |
RTX 3090 24 GB |
RTX 5090 32 GB |
RTX 6000 PRO 96 GB |
|---|---|---|---|---|---|---|
| Kokoro (82M) | ||||||
| Piper (ONNX) | ||||||
| MeloTTS | ||||||
| Chatterbox (0.5B) | ||||||
| F5-TTS (336M) | ||||||
| XTTS-v2 | OOM | |||||
| Bark | OOM | OOM | ||||
| Dia (1.6B) | OOM | OOM |
OOM = Out of Memory — model does not fit in available VRAM at FP16. Latency measured as median across 50 runs after warm-up. Input: 25-word English sentence. Streaming mode where supported. Bar width inversely proportional to latency (wider = faster).
Real-Time Factor (RTF) by GPU
RTF measures synthesis speed relative to audio duration. RTF 0.05 means a 10-second clip is generated in 0.5 seconds. Lower RTF = faster. Values below 1.0 are faster than real-time.
| Model | Params | RTX 4060 Ti (16 GB) | RTX 3090 (24 GB) | RTX 5090 (32 GB) | RTX 6000 PRO (96 GB) |
|---|---|---|---|---|---|
| Kokoro | 82M | ||||
| Piper | ~20M | ||||
| MeloTTS | ~110M | ||||
| Chatterbox | 0.5B | ||||
| F5-TTS | 336M | ||||
| XTTS-v2 | ~450M | ||||
| Bark | ~600M | ||||
| Dia | 1.6B |
RTF = Real-Time Factor. RTF 0.10 means 10 seconds of audio is generated in 1 second. All values measured at FP16 precision, single-stream inference, PyTorch backend. Bar width is inversely proportional to RTF (wider = faster).
Models Tested
The open source TTS models included in this benchmark round. Each model is tested at FP16 on identical hardware configurations.
VRAM Requirements by Model
How much GPU memory each TTS model needs at FP16 for single-stream inference — and which GigaGPU servers can run it.
| Model | VRAM (FP16) | Minimum GPU | Recommended GPU | Voice Agent Stack? |
|---|---|---|---|---|
| Kokoro (82M) | ~0.5 GB | RTX 3050 (6 GB) | RTX 4060 Ti (16 GB) | Yes — fits with LLM + ASR |
| Piper (ONNX) | ~0.3 GB | RTX 3050 (6 GB) | RTX 4060 (8 GB) | Yes — ultra-lightweight |
| MeloTTS | ~1 GB | RTX 3050 (6 GB) | RTX 4060 Ti (16 GB) | Yes — fits comfortably |
| Chatterbox (0.5B) | ~2–3 GB | RTX 4060 (8 GB) | RTX 3090 (24 GB) | Yes — with 24 GB+ GPU |
| F5-TTS (336M) | ~2 GB | RTX 4060 (8 GB) | RTX 3090 (24 GB) | Yes — with 24 GB+ GPU |
| XTTS-v2 | ~4–6 GB | RTX 4060 (8 GB) | RTX 3090 (24 GB) | Tight — 24 GB minimum |
| Bark | ~8–12 GB | RTX 4060 Ti (16 GB) | RTX 3090 (24 GB) | No — too heavy for stacking |
| Dia (1.6B) | ~10–14 GB | RTX 4060 Ti (16 GB) | RTX 5090 (32 GB) | 32 GB+ recommended |
Voice Agent Stack = ASR (Faster-Whisper ~3–4 GB) + LLM (7B Q4 ~6–8 GB) + TTS model running simultaneously on the same GPU. VRAM figures are for the TTS model alone.
GPU Recommendations for TTS Workloads
Based on our benchmark results, here are the best GPU choices for different TTS deployment scenarios.
8GB runs Kokoro, Piper, and MeloTTS with room to spare. A strong starting point for lightweight TTS APIs, internal narration tools, or adding TTS to an existing application on a tight budget.
16GB handles every lightweight TTS model in this benchmark. Ideal for prototyping voice agents, testing Kokoro or Piper, and development environments where you don’t need production concurrency.
The RTX 3090 runs every TTS model in this benchmark at production-ready latency. 24GB fits a full voice agent stack (Whisper + 7B LLM + TTS). Best price-to-performance for most deployments.
Blackwell 2.0 delivers the lowest first-audio latency across every model. For production voice agents targeting sub-100ms time-to-first-audio with Kokoro or sub-150ms with XTTS-v2, this is the GPU.
RDNA 4 architecture with 32GB VRAM at a competitive price point. A strong AMD option for teams running multi-model speech stacks or large batch TTS generation jobs with ROCm support.
96GB of GDDR7 runs multiple TTS models simultaneously, or a full voice agent stack with a 70B LLM. Designed for enterprise workloads with high concurrent request counts.
Self-Hosted TTS vs API Pricing
At production volumes, self-hosted TTS on a dedicated GPU eliminates per-character and per-minute fees entirely.
Managed TTS APIs
Prices scale linearly with usage. At 10M+ characters/month, costs compound quickly. Audio data is processed on third-party infrastructure.
GigaGPU Self-Hosted
Unlimited characters. Unlimited audio. Fixed monthly price. All audio stays on your server — no data leaves your environment. See live pricing →
Benchmark Methodology
How we test — consistent hardware, consistent software, consistent conditions.
Test Environment
Frequently Asked Questions
Common questions about TTS latency, GPU selection, and self-hosted text-to-speech performance.
Available on all servers
- 1Gbps Port
- NVMe Storage
- 128GB DDR4/DDR5
- Any OS
- 99.9% Uptime
- Root/Admin Access
Our dedicated GPU servers provide full hardware resources and a dedicated GPU card, ensuring consistent benchmark-grade performance for your TTS workloads. Deploy Kokoro, XTTS-v2, Bark, Chatterbox, or any open source speech model on the same hardware we use for these benchmarks.
Get in Touch
Not sure which GPU is right for your TTS workload? Our team can help you choose the right configuration based on your model, concurrency needs, and latency targets.
Contact Sales →Or explore Speech Model Hosting for deployment guides and model-specific setup instructions.
Run These Benchmarks on Your Own Server
Fixed monthly pricing. Dedicated GPU. UK data centre. Deploy the same hardware used in these benchmarks and start generating speech in under an hour.