RTX 3050 - Order Now

TTS Latency Benchmarks

Open Source Text-to-Speech Latency by Model & GPU — Real Hardware, Real Numbers

Compare first-audio latency and real-time factor (RTF) for Kokoro, XTTS-v2, Bark, F5-TTS, Chatterbox, Piper and more across GigaGPU’s dedicated GPU lineup. All tests run on UK bare metal servers with no shared resources.

Why TTS Latency Matters

For voice agents, IVR systems, audiobook pipelines and real-time narration, latency is the most important metric — it determines how quickly your users hear a response after sending text. A 50ms difference in time-to-first-audio can mean the difference between a natural conversation and an awkward pause.

We benchmark every open source TTS model on the same hardware, under the same conditions, so you can make an informed GPU choice for your workload. Every test runs on a GigaGPU dedicated GPU server — single-tenant bare metal, NVMe storage, and a dedicated GPU card. No virtualisation overhead, no noisy neighbours.

Below you’ll find first-audio latency (time from API call to first audio chunk), real-time factor (RTF), and throughput data for the most popular open source TTS models across our full GPU range.

8+
TTS Models Tested
11
GPU Configurations
RTF
Real-Time Factor
ms
First-Audio Latency
UK
Bare Metal Servers
0%
Virtualisation Overhead
FP16
Test Precision
50
Runs per Benchmark

All benchmarks run on dedicated single-tenant hardware — no shared GPUs, no throttling, no variance from other workloads.

Key Findings

The headline takeaways from our latest TTS latency benchmark round, covering the most deployed open source text-to-speech models.

Kokoro Is the Speed Champion

At just 82M parameters, Kokoro achieves RTF 0.03 on GPU — a 10-second clip synthesised in ~0.3 seconds. It consistently posts the lowest first-audio latency across every GPU tier, making it the default choice for latency-sensitive voice agents.

XTTS-v2 Trades Speed for Cloning

XTTS-v2’s voice cloning capability adds latency — expect 150–400ms first-audio depending on GPU. For applications where voice identity matters more than raw speed, it remains the leading open source option with 17-language support.

GPU Generation Matters Most

Blackwell 2.0 GPUs (RTX 5090, RTX 5080) deliver 30–40% lower latency than Ampere (RTX 3090) on the same model. For production voice agents targeting sub-200ms first-audio, the RTX 5090 is the clear winner.

RTX 3090 Remains Best Value

The RTX 3090’s 24GB VRAM and 936 GB/s bandwidth still delivers production-grade latency for most TTS models at the lowest cost per synthesised hour. It comfortably runs every model in this benchmark.

First-Audio Latency by Model & GPU

Time from API request to the first audio chunk being returned, in milliseconds. Lower is better. Select a GPU below to highlight its column, or view all at once. Tested with a standard 25-word English input sentence.

< 200ms (real-time ready)
200–500ms (near real-time)
> 500ms (batch / offline)
Model RTX 3050
6 GB
RTX 4060
8 GB
RTX 4060 Ti
16 GB
RTX 3090
24 GB
RTX 5090
32 GB
RTX 6000 PRO
96 GB
Kokoro (82M)
85ms
62ms
48ms
40ms
28ms
25ms
Piper (ONNX)
95ms
70ms
55ms
45ms
32ms
30ms
MeloTTS
130ms
95ms
72ms
58ms
42ms
38ms
Chatterbox (0.5B)
280ms
190ms
145ms
110ms
78ms
70ms
F5-TTS (336M)
320ms
220ms
165ms
130ms
90ms
82ms
XTTS-v2 OOM
380ms
260ms
190ms
135ms
120ms
Bark OOM OOM
680ms
420ms
290ms
250ms
Dia (1.6B) OOM OOM
820ms
490ms
340ms
295ms

OOM = Out of Memory — model does not fit in available VRAM at FP16. Latency measured as median across 50 runs after warm-up. Input: 25-word English sentence. Streaming mode where supported. Bar width inversely proportional to latency (wider = faster).

Kokoro
Fastest Overall
28ms
Best First-Audio (RTX 5090)
RTX 3090
Best Value GPU
XTTS-v2
Fastest with Voice Cloning

Real-Time Factor (RTF) by GPU

RTF measures synthesis speed relative to audio duration. RTF 0.05 means a 10-second clip is generated in 0.5 seconds. Lower RTF = faster. Values below 1.0 are faster than real-time.

Model Params RTX 4060 Ti (16 GB) RTX 3090 (24 GB) RTX 5090 (32 GB) RTX 6000 PRO (96 GB)
Kokoro 82M
0.03
0.02
0.01
0.01
Piper ~20M
0.04
0.03
0.02
0.02
MeloTTS ~110M
0.06
0.04
0.03
0.03
Chatterbox 0.5B
0.12
0.08
0.05
0.05
F5-TTS 336M
0.14
0.10
0.07
0.06
XTTS-v2 ~450M
0.30
0.18
0.12
0.10
Bark ~600M
0.60
0.40
0.28
0.25
Dia 1.6B
0.75
0.48
0.34
0.30

RTF = Real-Time Factor. RTF 0.10 means 10 seconds of audio is generated in 1 second. All values measured at FP16 precision, single-stream inference, PyTorch backend. Bar width is inversely proportional to RTF (wider = faster).

Models Tested

The open source TTS models included in this benchmark round. Each model is tested at FP16 on identical hardware configurations.

Kokoro
StyleTTS 2 Architecture
82M params 9 languages Apache 2.0
XTTS-v2
Coqui AI (Community)
~450M params 17 languages Voice cloning
Bark
Suno AI
~600M params Non-speech audio MIT
F5-TTS
Flow Matching
336M params Zero-shot cloning Apache 2.0
Chatterbox
Resemble AI
0.5B params Voice cloning Llama backbone
Piper
Rhasspy
~20M params 30+ languages Edge / CPU
MeloTTS
MyShell
~110M params Multilingual MIT
Dia
Nari Labs
1.6B params Multi-speaker Dialogue
Parler-TTS
Hugging Face
~600M params Text-described voice Apache 2.0
Spark-TTS
SparkAudio
0.5B params LLM backbone Apache 2.0

VRAM Requirements by Model

How much GPU memory each TTS model needs at FP16 for single-stream inference — and which GigaGPU servers can run it.

Model VRAM (FP16) Minimum GPU Recommended GPU Voice Agent Stack?
Kokoro (82M) ~0.5 GB RTX 3050 (6 GB) RTX 4060 Ti (16 GB) Yes — fits with LLM + ASR
Piper (ONNX) ~0.3 GB RTX 3050 (6 GB) RTX 4060 (8 GB) Yes — ultra-lightweight
MeloTTS ~1 GB RTX 3050 (6 GB) RTX 4060 Ti (16 GB) Yes — fits comfortably
Chatterbox (0.5B) ~2–3 GB RTX 4060 (8 GB) RTX 3090 (24 GB) Yes — with 24 GB+ GPU
F5-TTS (336M) ~2 GB RTX 4060 (8 GB) RTX 3090 (24 GB) Yes — with 24 GB+ GPU
XTTS-v2 ~4–6 GB RTX 4060 (8 GB) RTX 3090 (24 GB) Tight — 24 GB minimum
Bark ~8–12 GB RTX 4060 Ti (16 GB) RTX 3090 (24 GB) No — too heavy for stacking
Dia (1.6B) ~10–14 GB RTX 4060 Ti (16 GB) RTX 5090 (32 GB) 32 GB+ recommended

Voice Agent Stack = ASR (Faster-Whisper ~3–4 GB) + LLM (7B Q4 ~6–8 GB) + TTS model running simultaneously on the same GPU. VRAM figures are for the TTS model alone.

GPU Recommendations for TTS Workloads

Based on our benchmark results, here are the best GPU choices for different TTS deployment scenarios.

RTX 4060
8 GB VRAM
Budget Entry

8GB runs Kokoro, Piper, and MeloTTS with room to spare. A strong starting point for lightweight TTS APIs, internal narration tools, or adding TTS to an existing application on a tight budget.

Kokoro Piper MeloTTS
Configure RTX 4060 →
RTX 4060 Ti
16 GB VRAM
Development & Testing

16GB handles every lightweight TTS model in this benchmark. Ideal for prototyping voice agents, testing Kokoro or Piper, and development environments where you don’t need production concurrency.

Kokoro Piper MeloTTS F5-TTS
Configure RTX 4060 Ti →
RTX 3090
24 GB VRAM
Best Value for Production

The RTX 3090 runs every TTS model in this benchmark at production-ready latency. 24GB fits a full voice agent stack (Whisper + 7B LLM + TTS). Best price-to-performance for most deployments.

All models Voice agent stack XTTS-v2 cloning
Configure RTX 3090 →
RTX 5090
32 GB VRAM
Lowest Latency

Blackwell 2.0 delivers the lowest first-audio latency across every model. For production voice agents targeting sub-100ms time-to-first-audio with Kokoro or sub-150ms with XTTS-v2, this is the GPU.

Realtime voice agents Sub-100ms TTS Dia 1.6B
Configure RTX 5090 →
Radeon AI Pro R9700
32 GB VRAM
32 GB AMD Alternative

RDNA 4 architecture with 32GB VRAM at a competitive price point. A strong AMD option for teams running multi-model speech stacks or large batch TTS generation jobs with ROCm support.

Multi-model stacks Batch TTS ROCm ready
Configure R9700 →
RTX 6000 PRO
96 GB VRAM
Multi-Model & High Concurrency

96GB of GDDR7 runs multiple TTS models simultaneously, or a full voice agent stack with a 70B LLM. Designed for enterprise workloads with high concurrent request counts.

Enterprise pipelines 70B LLM + TTS Multi-voice serving
Configure RTX 6000 PRO →

Self-Hosted TTS vs API Pricing

At production volumes, self-hosted TTS on a dedicated GPU eliminates per-character and per-minute fees entirely.

Managed TTS APIs

ElevenLabs (Scale tier)~£80/mo for 2M chars
Google Cloud TTS (Neural)£12 per 1M chars
Amazon Polly (Neural)£12.80 per 1M chars
Azure Cognitive TTS£12 per 1M chars

Prices scale linearly with usage. At 10M+ characters/month, costs compound quickly. Audio data is processed on third-party infrastructure.

GigaGPU Self-Hosted

RTX 4060 Ti — Kokoro, PiperFixed/mo
RTX 3090 — All modelsFixed/mo
RTX 5090 — Lowest latencyFixed/mo
RTX 6000 PRO — EnterpriseFixed/mo

Unlimited characters. Unlimited audio. Fixed monthly price. All audio stays on your server — no data leaves your environment. See live pricing →

Benchmark Methodology

How we test — consistent hardware, consistent software, consistent conditions.

Test Environment

HardwareGigaGPU Dedicated Bare Metal (single-tenant)
CPUAMD Ryzen 9 / 128 GB DDR5
StorageNVMe SSD
OSUbuntu 22.04 LTS
FrameworkPyTorch 2.x / CUDA 12.x
PrecisionFP16
Inference ModeSingle-stream, streaming where supported
Input25-word English sentence (standard test prompt)
Warm-up10 runs discarded before measurement
MeasurementMedian of 50 runs
RTF CalculationWall-clock generation time / output audio duration

Frequently Asked Questions

Common questions about TTS latency, GPU selection, and self-hosted text-to-speech performance.

What is first-audio latency in TTS?
First-audio latency measures the time between sending a text input to the TTS model and receiving the first chunk of synthesised audio back. For voice agents and real-time applications, this is the most important metric — it determines how quickly your user hears a response. Lower first-audio latency means more natural conversations with fewer awkward pauses.
What is Real-Time Factor (RTF)?
RTF measures how fast a TTS model generates audio relative to the audio’s duration. An RTF of 0.10 means 10 seconds of audio is generated in 1 second. An RTF below 1.0 means the model is faster than real-time — essential for streaming applications. Kokoro achieves RTF 0.03 on an RTX 3090, meaning it generates audio roughly 33× faster than real-time.
Which TTS model has the lowest latency?
Kokoro (82M parameters) consistently achieves the lowest first-audio latency across every GPU tier in our benchmarks. On an RTX 5090, it delivers 28ms first-audio latency. Piper is a close second for pure speed, especially on lower-end hardware, though it offers less natural-sounding output.
Which GPU should I choose for a real-time voice agent?
For a full voice agent stack (ASR + LLM + TTS on one GPU), the RTX 3090 (24 GB) is the best value — it fits Faster-Whisper, a 7B LLM at Q4, and Kokoro TTS comfortably. If you need the absolute lowest latency, the RTX 5090 (32 GB) delivers Blackwell-generation speed with more VRAM headroom.
Can I run XTTS-v2 voice cloning in real-time?
Yes, but with caveats. XTTS-v2’s voice cloning adds latency compared to non-cloning models. On an RTX 3090, expect ~190ms first-audio latency — acceptable for most voice agent use cases. On an RTX 5090, this drops to ~135ms. For latency-critical applications, consider using Kokoro or Chatterbox for faster generation with their own cloning capabilities.
How much VRAM do I need for TTS?
Most TTS models are lightweight. Kokoro needs under 1 GB, MeloTTS needs about 1 GB, Chatterbox and F5-TTS need 2–3 GB, and XTTS-v2 needs 4–6 GB. The only VRAM-heavy models are Bark (8–12 GB) and Dia (10–14 GB). For a voice agent stack where TTS runs alongside ASR and an LLM, 24 GB is the practical minimum.
Is self-hosted TTS faster than cloud APIs?
Typically yes — you eliminate network round-trip latency entirely. A self-hosted Kokoro endpoint on a local RTX 5090 delivers 28ms first-audio latency. The same request to a cloud API adds 50–200ms of network latency on top of the model’s own inference time. For latency-sensitive applications, self-hosting is almost always faster.
How were these benchmarks conducted?
All tests run on GigaGPU dedicated bare metal servers with no virtualisation overhead. Each model is tested at FP16 precision using PyTorch with a standard 25-word English input sentence. We discard 10 warm-up runs and report the median of 50 consecutive measurements. The same test prompt and methodology is used across all GPUs and models for consistent comparison.

Available on all servers

  • 1Gbps Port
  • NVMe Storage
  • 128GB DDR4/DDR5
  • Any OS
  • 99.9% Uptime
  • Root/Admin Access

Our dedicated GPU servers provide full hardware resources and a dedicated GPU card, ensuring consistent benchmark-grade performance for your TTS workloads. Deploy Kokoro, XTTS-v2, Bark, Chatterbox, or any open source speech model on the same hardware we use for these benchmarks.

Get in Touch

Not sure which GPU is right for your TTS workload? Our team can help you choose the right configuration based on your model, concurrency needs, and latency targets.

Contact Sales →

Or explore Speech Model Hosting for deployment guides and model-specific setup instructions.

Run These Benchmarks on Your Own Server

Fixed monthly pricing. Dedicated GPU. UK data centre. Deploy the same hardware used in these benchmarks and start generating speech in under an hour.

Have a question? Need help?