RTX 3050 - Order Now
Home / Blog / Tutorials / Building a Voice Agent Pipeline on the RTX 5060 Ti 16 GB
Tutorials

Building a Voice Agent Pipeline on the RTX 5060 Ti 16 GB

Whisper + Llama 3 + Kokoro TTS as a complete voice agent stack on a single RTX 5060 Ti 16 GB. Latency budget, VRAM math, and the tools that wire it together.

A voice agent on a single dedicated GPU sounds ambitious for a 16 GB card. It works — barely — by picking the right small models. This page is the engineering playbook.

TL;DR

On a 5060 Ti 16 GB: Whisper Large-v3 (faster-whisper) + Llama 3.1 8B FP8 + Kokoro TTS = ~14 GB peak VRAM. ~600 ms end-to-end latency. Comfortable for ~6 concurrent voice agents. Above that, upgrade to a 5090.

Anatomy of a voice agent

Three models in series + a turn detector + an orchestrator:

  1. STT (speech-to-text) — Whisper Large-v3 or faster-whisper
  2. LLM — Llama 3.1 8B / Mistral 7B FP8
  3. TTS (text-to-speech) — Kokoro / XTTS / Bark
  4. Voice Activity Detection — Silero VAD (CPU)
  5. Orchestrator — Pipecat, LiveKit Agents, or custom

VRAM budget on 16 GB

ComponentVRAMNotes
Whisper Large-v3 (faster-whisper int8)~3 GBCTranslate2 backend, 4x faster than reference
Llama 3.1 8B FP8 (vLLM)~8 GB+ KV cache budget
KV cache for ~6 concurrent users at 4K context~3 GBFP8 KV cache
Kokoro TTS (82M params)~1 GBTiny, fast
Silero VADCPU only0 GB GPU
Peak VRAM~15 GBJust fits — no headroom

Tight on a 5060 Ti. If you want XTTS instead of Kokoro (~4 GB) you have to drop the LLM to INT4 or bump to a 5080.

Setup walkthrough

Three processes on the same host:

# Process 1: vLLM serving Llama 3.1 8B FP8
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 12 --max-model-len 8192 \
  --gpu-memory-utilization 0.55 --port 8000

# Process 2: faster-whisper as REST service
python -m faster_whisper_server --model large-v3 --port 8001 \
  --device cuda --compute-type int8

# Process 3: Kokoro TTS REST service
python -m kokoro_tts_server --port 8002 --device cuda

The --gpu-memory-utilization 0.55 on vLLM is critical — it leaves room for Whisper and Kokoro. Without that bound, vLLM grabs all the VRAM and the other models OOM.

Wire them together with Pipecat:

import pipecat
from pipecat.transports.daily import DailyTransport
from pipecat.services.openai import OpenAILLMService
from pipecat.services.whisper import WhisperSTTService
from pipecat.services.kokoro import KokoroTTSService

stt = WhisperSTTService(api_url="http://localhost:8001")
llm = OpenAILLMService(api_url="http://localhost:8000/v1",
                       model="meta-llama/Meta-Llama-3.1-8B-Instruct")
tts = KokoroTTSService(api_url="http://localhost:8002")

agent = pipecat.Pipeline([stt, llm, tts])
agent.run()

Latency budget

Target end-to-end (user stops speaking → first audio): <800 ms. Budget breakdown:

StageLatency on 5060 TiNotes
VAD endpoint detection~150 msCPU-bound; tunable
Whisper STT~250 msFor ~3-second utterances
LLM TTFT~180 msLlama 3.1 8B FP8 with prefix caching
First TTS chunk~50 msKokoro is fast
End-to-end~630 msHits sub-800ms target

Concurrency & when to upgrade

The bottleneck is VRAM. ~6 concurrent voice agents fit; above that:

  • Upgrade to RTX 5080 — same VRAM but ~40% faster (helps latency, not concurrency)
  • Upgrade to RTX 5090 — 32 GB lets you run ~16 concurrent voice agents on one card
  • Scale horizontally — two 5060 Tis with sticky session routing

Verdict

The 5060 Ti is the cheapest dedicated GPU we host that runs a complete voice agent stack — Whisper + 8B LLM + TTS — on one card. Tight, but works for small concurrency. For real production voice deployments at scale, the RTX 5090 32 GB is the right home: same setup with 2× concurrent agents and significant headroom.

Bottom line

For a single-card voice agent at £119/mo, the 5060 Ti is the right starting point. Stack Whisper + Llama 3.1 8B FP8 + Kokoro and you have a complete voice stack. Above ~6 concurrent calls, upgrade. For a deeper Whisper-specific deployment guide see Whisper hosting.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?