Table of Contents
Real-time voice agents need TTS that starts speaking before the full response is generated. Streaming TTS is the pattern.
Use Kokoro for sub-100ms first-audio. Stream chunks over WebSocket. Pipeline: LLM token stream → sentence buffer → TTS chunk → WebSocket → client. End-to-end first-audio: 60-100 ms.
Architecture
- LLM produces token stream
- Sentence-end detector buffers until punctuation
- Buffered sentence sent to TTS
- TTS generates audio chunks (200-500 ms each)
- Chunks streamed to client over WebSocket
- Client plays incrementally
Best models for streaming
- Kokoro: 60 ms first chunk, fastest
- XTTS v2: ~120 ms first chunk, voice cloning
- Bark: ~250 ms first chunk, naturalness
Verdict
For sub-100ms TTS streaming, Kokoro is the only credible open-weight option. Pair with WebSocket transport.
Bottom line
Streaming TTS makes voice agents feel real-time. See voice agent latency.