Home / Blog / Tutorials / Self-Hosted TTS Streaming Architecture: Sub-100ms First Audio

Tutorials

Self-Hosted TTS Streaming Architecture: Sub-100ms First Audio

How to architect a TTS streaming pipeline that delivers first audio in under 100ms — chunk generation, WebSocket streaming, and the gotchas.

Tutorials May 5, 2026 1 min read gigagpu

Table of Contents

Real-time voice agents need TTS that starts speaking before the full response is generated. Streaming TTS is the pattern.

TL;DR

Use Kokoro for sub-100ms first-audio. Stream chunks over WebSocket. Pipeline: LLM token stream → sentence buffer → TTS chunk → WebSocket → client. End-to-end first-audio: 60-100 ms.

Architecture

LLM produces token stream
Sentence-end detector buffers until punctuation
Buffered sentence sent to TTS
TTS generates audio chunks (200-500 ms each)
Chunks streamed to client over WebSocket
Client plays incrementally

Best models for streaming

Kokoro: 60 ms first chunk, fastest
XTTS v2: ~120 ms first chunk, voice cloning
Bark: ~250 ms first chunk, naturalness

Verdict

For sub-100ms TTS streaming, Kokoro is the only credible open-weight option. Pair with WebSocket transport.

Bottom line

Streaming TTS makes voice agents feel real-time. See voice agent latency.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Self-Hosted TTS Streaming Architecture: Sub-100ms First Audio

Architecture

Best models for streaming

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Self-Hosted TTS Streaming Architecture: Sub-100ms First Audio

Architecture

Best models for streaming

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

Prompt Template Versioning in Production

llama.cpp n-gpu-layers Tuning for Mixed Inference

Voice Agent Pipeline with Whisper, LLM, and Coqui TTS

Eight AI Self-Hosting Mistakes That Cost Real Money

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?