Table of Contents
Sub-second voice agents feel natural. Sub-500ms agents feel uncannily human. The difference is twelve specific optimisations across the pipeline.
Combine: VAD-driven endpointing, streaming Whisper, FP8 LLM, prefix caching, fast TTS (Kokoro), chunked TTS streaming, local LLM. Net: ~480 ms end-to-end on a 5090.
The latency budget
| Stage | Naive | Optimised | Saving |
|---|---|---|---|
| VAD endpointing | 300 ms | 120 ms | 180 ms |
| Whisper STT | 500 ms | 180 ms | 320 ms |
| LLM TTFT | 300 ms | 120 ms | 180 ms |
| TTS first chunk | 400 ms | 60 ms | 340 ms |
| Total | 1.5 s | ~480 ms | ~1 s saved |
Twelve specific optimisations
- Silero VAD with aggressive threshold — 120 ms endpointing
- faster-whisper INT8 — 4× faster than reference Whisper
- Whisper-Streaming with 200ms chunks — overlap STT with VAD
- FP8 LLM weights — halves prefill time on Blackwell
- vLLM prefix caching — reuses system prompt KV
- FP8 KV cache — fits more context in pool
- Speculative decoding — 1.5× faster decoding
- Streaming TTS (Kokoro) — first audio chunk in 60ms
- Pre-generate intro audio — first sentence cached
- Local LLM — eliminates cross-region RTT
- RTX 5090 32 GB — all models hot, no VRAM swap
- Pipecat over LiveKit — lower orchestration overhead
Verdict
Sub-500ms voice agents are achievable on dedicated GPU hardware in 2026. Most teams stop at sub-1s because the additional 500ms is hard work — twelve optimisations across the pipeline.
Bottom line
For voice agents that feel human, latency budget matters more than model quality. See voice agent deployment.