TTS Latency Benchmarks

Q: Can I run XTTS-v2 voice cloning in real-time?

Yes, but XTTS-v2's voice cloning adds latency. On an RTX 3090, expect ~190ms first-audio latency. On an RTX 5090, this drops to ~135ms. For latency-critical applications, consider Kokoro or Chatterbox.

Open Source Text-to-Speech Latency by Model & GPU — Real Hardware, Real Numbers

Compare first-audio latency and real-time factor (RTF) for Kokoro, XTTS-v2, Bark, F5-TTS, Chatterbox, Piper and more across GigaGPU’s dedicated GPU lineup. All tests run on UK bare metal servers with no shared resources.

Why TTS Latency Matters

For voice agents, IVR systems, audiobook pipelines and real-time narration, latency is the most important metric — it determines how quickly your users hear a response after sending text. A 50ms difference in time-to-first-audio can mean the difference between a natural conversation and an awkward pause.

We benchmark every open source TTS model on the same hardware, under the same conditions, so you can make an informed GPU choice for your workload. Every test runs on a GigaGPU dedicated GPU server — single-tenant bare metal, NVMe storage, and a dedicated GPU card. No virtualisation overhead, no noisy neighbours.

Below you’ll find first-audio latency (time from API call to first audio chunk), real-time factor (RTF), and throughput data for the most popular open source TTS models across our full GPU range.

TTS Models Tested

GPU Configurations

RTF

Real-Time Factor

First-Audio Latency

Bare Metal Servers

Virtualisation Overhead

FP16

Test Precision

Runs per Benchmark

All benchmarks run on dedicated single-tenant hardware — no shared GPUs, no throttling, no variance from other workloads.

Key Findings

The headline takeaways from our latest TTS latency benchmark round, covering the most deployed open source text-to-speech models.

Kokoro Is the Speed Champion

At just 82M parameters, Kokoro achieves RTF 0.03 on GPU — a 10-second clip synthesised in ~0.3 seconds. It consistently posts the lowest first-audio latency across every GPU tier, making it the default choice for latency-sensitive voice agents.

XTTS-v2 Trades Speed for Cloning

XTTS-v2’s voice cloning capability adds latency — expect 150–400ms first-audio depending on GPU. For applications where voice identity matters more than raw speed, it remains the leading open source option with 17-language support.

GPU Generation Matters Most

Blackwell 2.0 GPUs (RTX 5090, RTX 5080) deliver 30–40% lower latency than Ampere (RTX 3090) on the same model. For production voice agents targeting sub-200ms first-audio, the RTX 5090 is the clear winner.

RTX 3090 Remains Best Value

The RTX 3090’s 24GB VRAM and 936 GB/s bandwidth still delivers production-grade latency for most TTS models at the lowest cost per synthesised hour. It comfortably runs every model in this benchmark.

First-Audio Latency by Model & GPU

Time from API request to the first audio chunk being returned, in milliseconds. Lower is better. Select a GPU below to highlight its column, or view all at once. Tested with a standard 25-word English input sentence.

< 200ms (real-time ready)

200–500ms (near real-time)

> 500ms (batch / offline)

Model	RTX 3050 6 GB	RTX 4060 8 GB	RTX 4060 Ti 16 GB	RTX 3090 24 GB	RTX 5090 32 GB	RTX 6000 PRO 96 GB
Kokoro (82M)	85ms	62ms	48ms	40ms	28ms	25ms
Piper (ONNX)	95ms	70ms	55ms	45ms	32ms	30ms
MeloTTS	130ms	95ms	72ms	58ms	42ms	38ms
Chatterbox (0.5B)	280ms	190ms	145ms	110ms	78ms	70ms
F5-TTS (336M)	320ms	220ms	165ms	130ms	90ms	82ms
XTTS-v2	OOM	380ms	260ms	190ms	135ms	120ms
Bark	OOM	OOM	680ms	420ms	290ms	250ms
Dia (1.6B)	OOM	OOM	820ms	490ms	340ms	295ms

OOM = Out of Memory — model does not fit in available VRAM at FP16. Latency measured as median across 50 runs after warm-up. Input: 25-word English sentence. Streaming mode where supported. Bar width inversely proportional to latency (wider = faster).

Kokoro

Fastest Overall

28ms

Best First-Audio (RTX 5090)

RTX 3090

Best Value GPU

XTTS-v2

Fastest with Voice Cloning

Real-Time Factor (RTF) by GPU

RTF measures synthesis speed relative to audio duration. RTF 0.05 means a 10-second clip is generated in 0.5 seconds. Lower RTF = faster. Values below 1.0 are faster than real-time.

Model	Params	RTX 4060 Ti (16 GB)	RTX 3090 (24 GB)	RTX 5090 (32 GB)	RTX 6000 PRO (96 GB)
Kokoro	82M	0.03	0.02	0.01	0.01
Piper	~20M	0.04	0.03	0.02	0.02
MeloTTS	~110M	0.06	0.04	0.03	0.03
Chatterbox	0.5B	0.12	0.08	0.05	0.05
F5-TTS	336M	0.14	0.10	0.07	0.06
XTTS-v2	~450M	0.30	0.18	0.12	0.10
Bark	~600M	0.60	0.40	0.28	0.25
Dia	1.6B	0.75	0.48	0.34	0.30

RTF = Real-Time Factor. RTF 0.10 means 10 seconds of audio is generated in 1 second. All values measured at FP16 precision, single-stream inference, PyTorch backend. Bar width is inversely proportional to RTF (wider = faster).

Models Tested

The open source TTS models included in this benchmark round. Each model is tested at FP16 on identical hardware configurations.

Kokoro

StyleTTS 2 Architecture

82M params 9 languages Apache 2.0

XTTS-v2

Coqui AI (Community)

~450M params 17 languages Voice cloning

Bark

Suno AI

~600M params Non-speech audio MIT

F5-TTS

Flow Matching

336M params Zero-shot cloning Apache 2.0

Chatterbox

Resemble AI

0.5B params Voice cloning Llama backbone

Piper

Rhasspy

~20M params 30+ languages Edge / CPU

MeloTTS

MyShell

~110M params Multilingual MIT

Dia

Nari Labs

1.6B params Multi-speaker Dialogue

Parler-TTS

Hugging Face

~600M params Text-described voice Apache 2.0

Spark-TTS

SparkAudio

0.5B params LLM backbone Apache 2.0

VRAM Requirements by Model

How much GPU memory each TTS model needs at FP16 for single-stream inference — and which GigaGPU servers can run it.

Model	VRAM (FP16)	Minimum GPU	Recommended GPU	Voice Agent Stack?
Kokoro (82M)	~0.5 GB	RTX 3050 (6 GB)	RTX 4060 Ti (16 GB)	Yes — fits with LLM + ASR
Piper (ONNX)	~0.3 GB	RTX 3050 (6 GB)	RTX 4060 (8 GB)	Yes — ultra-lightweight
MeloTTS	~1 GB	RTX 3050 (6 GB)	RTX 4060 Ti (16 GB)	Yes — fits comfortably
Chatterbox (0.5B)	~2–3 GB	RTX 4060 (8 GB)	RTX 3090 (24 GB)	Yes — with 24 GB+ GPU
F5-TTS (336M)	~2 GB	RTX 4060 (8 GB)	RTX 3090 (24 GB)	Yes — with 24 GB+ GPU
XTTS-v2	~4–6 GB	RTX 4060 (8 GB)	RTX 3090 (24 GB)	Tight — 24 GB minimum
Bark	~8–12 GB	RTX 4060 Ti (16 GB)	RTX 3090 (24 GB)	No — too heavy for stacking
Dia (1.6B)	~10–14 GB	RTX 4060 Ti (16 GB)	RTX 5090 (32 GB)	32 GB+ recommended

Voice Agent Stack = ASR (Faster-Whisper ~3–4 GB) + LLM (7B Q4 ~6–8 GB) + TTS model running simultaneously on the same GPU. VRAM figures are for the TTS model alone.

GPU Recommendations for TTS Workloads

Based on our benchmark results, here are the best GPU choices for different TTS deployment scenarios.

RTX 4060

8 GB VRAM

Budget Entry

8GB runs Kokoro, Piper, and MeloTTS with room to spare. A strong starting point for lightweight TTS APIs, internal narration tools, or adding TTS to an existing application on a tight budget.

Kokoro Piper MeloTTS

Configure RTX 4060 →

RTX 4060 Ti

16 GB VRAM

Development & Testing

16GB handles every lightweight TTS model in this benchmark. Ideal for prototyping voice agents, testing Kokoro or Piper, and development environments where you don’t need production concurrency.

Kokoro Piper MeloTTS F5-TTS

Configure RTX 4060 Ti →

RTX 3090

24 GB VRAM

Best Value for Production

The RTX 3090 runs every TTS model in this benchmark at production-ready latency. 24GB fits a full voice agent stack (Whisper + 7B LLM + TTS). Best price-to-performance for most deployments.

All models Voice agent stack XTTS-v2 cloning

Configure RTX 3090 →

RTX 5090

32 GB VRAM

Lowest Latency

Blackwell 2.0 delivers the lowest first-audio latency across every model. For production voice agents targeting sub-100ms time-to-first-audio with Kokoro or sub-150ms with XTTS-v2, this is the GPU.

Realtime voice agents Sub-100ms TTS Dia 1.6B

Configure RTX 5090 →

Radeon AI Pro R9700

32 GB VRAM

32 GB AMD Alternative

RDNA 4 architecture with 32GB VRAM at a competitive price point. A strong AMD option for teams running multi-model speech stacks or large batch TTS generation jobs with ROCm support.

Multi-model stacks Batch TTS ROCm ready

Configure R9700 →

RTX 6000 PRO

96 GB VRAM

Multi-Model & High Concurrency

96GB of GDDR7 runs multiple TTS models simultaneously, or a full voice agent stack with a 70B LLM. Designed for enterprise workloads with high concurrent request counts.

Enterprise pipelines 70B LLM + TTS Multi-voice serving

Configure RTX 6000 PRO →

Self-Hosted TTS vs API Pricing

At production volumes, self-hosted TTS on a dedicated GPU eliminates per-character and per-minute fees entirely.

Managed TTS APIs

ElevenLabs (Scale tier)~£80/mo for 2M chars

Google Cloud TTS (Neural)£12 per 1M chars

Amazon Polly (Neural)£12.80 per 1M chars

Azure Cognitive TTS£12 per 1M chars

Prices scale linearly with usage. At 10M+ characters/month, costs compound quickly. Audio data is processed on third-party infrastructure.

GigaGPU Self-Hosted

RTX 4060 Ti — Kokoro, PiperFixed/mo

RTX 3090 — All modelsFixed/mo

RTX 5090 — Lowest latencyFixed/mo

RTX 6000 PRO — EnterpriseFixed/mo

Unlimited characters. Unlimited audio. Fixed monthly price. All audio stays on your server — no data leaves your environment. See live pricing →

Benchmark Methodology

How we test — consistent hardware, consistent software, consistent conditions.

Test Environment

HardwareGigaGPU Dedicated Bare Metal (single-tenant)

CPUAMD Ryzen 9 / 128 GB DDR5

StorageNVMe SSD

OSUbuntu 22.04 LTS

FrameworkPyTorch 2.x / CUDA 12.x

PrecisionFP16

Inference ModeSingle-stream, streaming where supported

Input25-word English sentence (standard test prompt)

Warm-up10 runs discarded before measurement

MeasurementMedian of 50 runs

RTF CalculationWall-clock generation time / output audio duration

Frequently Asked Questions

Common questions about TTS latency, GPU selection, and self-hosted text-to-speech performance.

What is first-audio latency in TTS?

First-audio latency measures the time between sending a text input to the TTS model and receiving the first chunk of synthesised audio back. For voice agents and real-time applications, this is the most important metric — it determines how quickly your user hears a response. Lower first-audio latency means more natural conversations with fewer awkward pauses.

What is Real-Time Factor (RTF)?

RTF measures how fast a TTS model generates audio relative to the audio’s duration. An RTF of 0.10 means 10 seconds of audio is generated in 1 second. An RTF below 1.0 means the model is faster than real-time — essential for streaming applications. Kokoro achieves RTF 0.03 on an RTX 3090, meaning it generates audio roughly 33× faster than real-time.

Which TTS model has the lowest latency?

Kokoro (82M parameters) consistently achieves the lowest first-audio latency across every GPU tier in our benchmarks. On an RTX 5090, it delivers 28ms first-audio latency. Piper is a close second for pure speed, especially on lower-end hardware, though it offers less natural-sounding output.

Which GPU should I choose for a real-time voice agent?

For a full voice agent stack (ASR + LLM + TTS on one GPU), the RTX 3090 (24 GB) is the best value — it fits Faster-Whisper, a 7B LLM at Q4, and Kokoro TTS comfortably. If you need the absolute lowest latency, the RTX 5090 (32 GB) delivers Blackwell-generation speed with more VRAM headroom.

Can I run XTTS-v2 voice cloning in real-time?

Yes, but with caveats. XTTS-v2’s voice cloning adds latency compared to non-cloning models. On an RTX 3090, expect ~190ms first-audio latency — acceptable for most voice agent use cases. On an RTX 5090, this drops to ~135ms. For latency-critical applications, consider using Kokoro or Chatterbox for faster generation with their own cloning capabilities.

How much VRAM do I need for TTS?

Most TTS models are lightweight. Kokoro needs under 1 GB, MeloTTS needs about 1 GB, Chatterbox and F5-TTS need 2–3 GB, and XTTS-v2 needs 4–6 GB. The only VRAM-heavy models are Bark (8–12 GB) and Dia (10–14 GB). For a voice agent stack where TTS runs alongside ASR and an LLM, 24 GB is the practical minimum.

Is self-hosted TTS faster than cloud APIs?

Typically yes — you eliminate network round-trip latency entirely. A self-hosted Kokoro endpoint on a local RTX 5090 delivers 28ms first-audio latency. The same request to a cloud API adds 50–200ms of network latency on top of the model’s own inference time. For latency-sensitive applications, self-hosting is almost always faster.

How were these benchmarks conducted?

All tests run on GigaGPU dedicated bare metal servers with no virtualisation overhead. Each model is tested at FP16 precision using PyTorch with a standard 25-word English input sentence. We discard 10 warm-up runs and report the median of 50 consecutive measurements. The same test prompt and methodology is used across all GPUs and models for consistent comparison.

Available on all servers

1Gbps Port
NVMe Storage
128GB DDR4/DDR5
Any OS
99.9% Uptime
Root/Admin Access

Our dedicated GPU servers provide full hardware resources and a dedicated GPU card, ensuring consistent benchmark-grade performance for your TTS workloads. Deploy Kokoro, XTTS-v2, Bark, Chatterbox, or any open source speech model on the same hardware we use for these benchmarks.

Get in Touch

Not sure which GPU is right for your TTS workload? Our team can help you choose the right configuration based on your model, concurrency needs, and latency targets.

Contact Sales →

Or explore Speech Model Hosting for deployment guides and model-specific setup instructions.

Run These Benchmarks on Your Own Server

Fixed monthly pricing. Dedicated GPU. UK data centre. Deploy the same hardware used in these benchmarks and start generating speech in under an hour.

View All GPU Plans Speech Model Hosting TTS Cost Calculator