Build a full voice pipeline (speech-in, speech-out) on one RTX 5060 Ti 16GB via our hosting. Total latency under 2 seconds.
Contents
Components
- Faster-Whisper (Whisper large-v3-turbo INT8) – ASR
- vLLM Llama 3 8B FP8 – reasoning / reply
- Coqui XTTS v2 – TTS (or Bark for expressive output)
- FastAPI front – receive audio, return audio
Install
uv pip install faster-whisper TTS fastapi uvicorn httpx pydub
vLLM runs in its own service (port 8000) – see vLLM setup.
Orchestration (FastAPI)
from fastapi import FastAPI, UploadFile, Response
from faster_whisper import WhisperModel
from TTS.api import TTS
import httpx, io
asr = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8_float16")
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
llm = httpx.AsyncClient(base_url="http://localhost:8000")
app = FastAPI()
@app.post("/speak")
async def speak(audio: UploadFile):
# 1. ASR
segments, _ = asr.transcribe(io.BytesIO(await audio.read()))
text = " ".join(s.text for s in segments)
# 2. LLM
resp = await llm.post("/v1/chat/completions", json={
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "Brief, friendly voice assistant."},
{"role": "user", "content": text},
],
"max_tokens": 120,
})
reply = resp.json()["choices"][0]["message"]["content"]
# 3. TTS
wav_bytes = io.BytesIO()
tts.tts_to_file(text=reply, file_path=wav_bytes, speaker_wav="ref.wav", language="en")
return Response(wav_bytes.getvalue(), media_type="audio/wav")
Streaming
For <1s perceived latency:
- Stream Whisper (segment-by-segment as audio arrives)
- Stream vLLM response tokens
- Chunk TTS by sentence – start playing as soon as first sentence synthesises
Full duplex setups (interruptible voice) need VAD on the client side to know when user starts speaking, plus a cancellation channel to stop the LLM mid-stream.
Voice Pipeline on Blackwell 16GB
ASR + LLM + TTS, under 2 seconds. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: voice assistant, Whisper, Coqui TTS, Whisper API.