Table of Contents
A voice agent on a single dedicated GPU sounds ambitious for a 16 GB card. It works — barely — by picking the right small models. This page is the engineering playbook.
On a 5060 Ti 16 GB: Whisper Large-v3 (faster-whisper) + Llama 3.1 8B FP8 + Kokoro TTS = ~14 GB peak VRAM. ~600 ms end-to-end latency. Comfortable for ~6 concurrent voice agents. Above that, upgrade to a 5090.
Anatomy of a voice agent
Three models in series + a turn detector + an orchestrator:
- STT (speech-to-text) — Whisper Large-v3 or faster-whisper
- LLM — Llama 3.1 8B / Mistral 7B FP8
- TTS (text-to-speech) — Kokoro / XTTS / Bark
- Voice Activity Detection — Silero VAD (CPU)
- Orchestrator — Pipecat, LiveKit Agents, or custom
VRAM budget on 16 GB
| Component | VRAM | Notes |
|---|---|---|
| Whisper Large-v3 (faster-whisper int8) | ~3 GB | CTranslate2 backend, 4x faster than reference |
| Llama 3.1 8B FP8 (vLLM) | ~8 GB | + KV cache budget |
| KV cache for ~6 concurrent users at 4K context | ~3 GB | FP8 KV cache |
| Kokoro TTS (82M params) | ~1 GB | Tiny, fast |
| Silero VAD | CPU only | 0 GB GPU |
| Peak VRAM | ~15 GB | Just fits — no headroom |
Tight on a 5060 Ti. If you want XTTS instead of Kokoro (~4 GB) you have to drop the LLM to INT4 or bump to a 5080.
Setup walkthrough
Three processes on the same host:
# Process 1: vLLM serving Llama 3.1 8B FP8
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--quantization fp8 --kv-cache-dtype fp8_e4m3 \
--max-num-seqs 12 --max-model-len 8192 \
--gpu-memory-utilization 0.55 --port 8000
# Process 2: faster-whisper as REST service
python -m faster_whisper_server --model large-v3 --port 8001 \
--device cuda --compute-type int8
# Process 3: Kokoro TTS REST service
python -m kokoro_tts_server --port 8002 --device cuda
The --gpu-memory-utilization 0.55 on vLLM is critical — it leaves room for Whisper and Kokoro. Without that bound, vLLM grabs all the VRAM and the other models OOM.
Wire them together with Pipecat:
import pipecat
from pipecat.transports.daily import DailyTransport
from pipecat.services.openai import OpenAILLMService
from pipecat.services.whisper import WhisperSTTService
from pipecat.services.kokoro import KokoroTTSService
stt = WhisperSTTService(api_url="http://localhost:8001")
llm = OpenAILLMService(api_url="http://localhost:8000/v1",
model="meta-llama/Meta-Llama-3.1-8B-Instruct")
tts = KokoroTTSService(api_url="http://localhost:8002")
agent = pipecat.Pipeline([stt, llm, tts])
agent.run()
Latency budget
Target end-to-end (user stops speaking → first audio): <800 ms. Budget breakdown:
| Stage | Latency on 5060 Ti | Notes |
|---|---|---|
| VAD endpoint detection | ~150 ms | CPU-bound; tunable |
| Whisper STT | ~250 ms | For ~3-second utterances |
| LLM TTFT | ~180 ms | Llama 3.1 8B FP8 with prefix caching |
| First TTS chunk | ~50 ms | Kokoro is fast |
| End-to-end | ~630 ms | Hits sub-800ms target |
Concurrency & when to upgrade
The bottleneck is VRAM. ~6 concurrent voice agents fit; above that:
- Upgrade to RTX 5080 — same VRAM but ~40% faster (helps latency, not concurrency)
- Upgrade to RTX 5090 — 32 GB lets you run ~16 concurrent voice agents on one card
- Scale horizontally — two 5060 Tis with sticky session routing
Verdict
The 5060 Ti is the cheapest dedicated GPU we host that runs a complete voice agent stack — Whisper + 8B LLM + TTS — on one card. Tight, but works for small concurrency. For real production voice deployments at scale, the RTX 5090 32 GB is the right home: same setup with 2× concurrent agents and significant headroom.
Bottom line
For a single-card voice agent at £119/mo, the 5060 Ti is the right starting point. Stack Whisper + Llama 3.1 8B FP8 + Kokoro and you have a complete voice stack. Above ~6 concurrent calls, upgrade. For a deeper Whisper-specific deployment guide see Whisper hosting.