Voice agents fail at the seams — STT to LLM, LLM to TTS, TTS to network, network to phone — even when each component is fine. Production deployment is mostly the orchestration layer.
For self-hosted voice: Whisper Large-v3 + Llama 3.1 8B FP8 + Kokoro TTS on a 5090, orchestrated by Pipecat or LiveKit Agents, fronted by Twilio Media Streams for telephony. Sub-1s end-to-end achievable.
Full stack
- STT: faster-whisper Large-v3, INT8
- VAD: Silero VAD (CPU)
- LLM: vLLM serving Llama 3.1 8B FP8
- TTS: Kokoro (or XTTS for voice cloning)
- Orchestrator: Pipecat or LiveKit Agents
- Telephony: Twilio Media Streams or LiveKit Cloud (or self-hosted FreeSWITCH for SIP)
- Hardware: RTX 5090 32 GB
Orchestration: Pipecat or LiveKit
- Pipecat: Python-first, lightweight, easy to customise. Best for custom voice flows.
- LiveKit Agents: full WebRTC stack, mature media handling. Best for browser-based voice.
For phone agents (Twilio), Pipecat is more popular. For browser agents, LiveKit.
Telephony
Three options:
- Twilio Media Streams: easiest, US-hosted, ~$0.0085/min for inbound
- LiveKit Cloud: media + signaling managed
- Self-hosted SIP (FreeSWITCH / Asterisk): full control, more operational complexity
Latency budget
Target: end-to-end <800 ms (user stops speaking → first audio out). On a 5090:
| Stage | Latency |
|---|---|
| Network jitter buffer | ~80 ms |
| VAD endpointing | ~120 ms |
| Whisper STT | ~180 ms |
| LLM TTFT | ~120 ms |
| TTS first chunk | ~80 ms |
| Network return | ~50 ms |
| Total | ~630 ms |
Verdict
Self-hosted voice agents at production quality require the 5090’s 32 GB to fit the full stack. For 16 concurrent calls, plan for 2× cards or step up to a 6000 Pro.
Bottom line
Voice is the workload where dedicated hardware in your region pays back the most — every cross-region hop adds 80+ms of unavoidable latency. See Whisper hosting and Coqui XTTS deployment.