A complete voice assistant (mic-to-speech-in) on one RTX 5060 Ti 16GB via our hosting. No cloud APIs, no round-trip latency, UK data jurisdiction.
Contents
Pipeline
Mic audio -> VAD (silence trim)
-> Whisper large-v3-turbo (ASR)
-> Llama 3.1 8B FP8 (reasoning + reply)
-> XTTS v2 (TTS)
-> Speaker audio
VRAM Budget
| Component | VRAM |
|---|---|
| Whisper Turbo INT8 | 1.6 GB |
| Llama 3.1 8B FP8 + FP8 KV | ~10 GB (8k context) |
| XTTS v2 | ~3 GB |
| Headroom | ~1.4 GB |
Latency Budget
| Stage | Time (10s user utterance) |
|---|---|
| VAD detection | ~100 ms |
| Whisper Turbo transcribe | 180 ms |
| LLM TTFT (prefix-cached system prompt) | 80 ms |
| LLM decode 60 tokens | 540 ms |
| XTTS synthesize 6s audio | 900 ms |
| Total | ~1.8 s |
Under 2 seconds end-to-end, user speech to reply audio. Close to human-conversational latency.
Optional Layers
- Wake-word detection: openWakeWord on CPU, zero GPU cost
- Streaming TTS: generate audio chunks as LLM streams, drop TTS latency to ~200 ms
- RAG-backed memory: inject retrieved facts as system prompt
- Voice cloning: XTTS with 6s reference clip for persona voice
For production-grade voice assistants this is a complete stack on a single GPU at flat monthly cost.
Voice Assistant on Blackwell 16GB
ASR + LLM + TTS, < 2s latency. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: voice pipeline setup, Whisper benchmark, Coqui TTS, Bark TTS, Whisper API.