RTX 3050 - Order Now
Home / Blog / Tutorials / Self-Hosted TTS Streaming Architecture: Sub-100ms First Audio
Tutorials

Self-Hosted TTS Streaming Architecture: Sub-100ms First Audio

How to architect a TTS streaming pipeline that delivers first audio in under 100ms — chunk generation, WebSocket streaming, and the gotchas.

Real-time voice agents need TTS that starts speaking before the full response is generated. Streaming TTS is the pattern.

TL;DR

Use Kokoro for sub-100ms first-audio. Stream chunks over WebSocket. Pipeline: LLM token stream → sentence buffer → TTS chunk → WebSocket → client. End-to-end first-audio: 60-100 ms.

Architecture

  1. LLM produces token stream
  2. Sentence-end detector buffers until punctuation
  3. Buffered sentence sent to TTS
  4. TTS generates audio chunks (200-500 ms each)
  5. Chunks streamed to client over WebSocket
  6. Client plays incrementally

Best models for streaming

  • Kokoro: 60 ms first chunk, fastest
  • XTTS v2: ~120 ms first chunk, voice cloning
  • Bark: ~250 ms first chunk, naturalness

Verdict

For sub-100ms TTS streaming, Kokoro is the only credible open-weight option. Pair with WebSocket transport.

Bottom line

Streaming TTS makes voice agents feel real-time. See voice agent latency.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?