You will build a voice agent that listens to a spoken question, transcribes it with Whisper, generates an answer with a self-hosted LLM, and speaks the response aloud using Coqui TTS. The complete pipeline runs on a single GPU — no cloud speech APIs, no third-party transcription services, no data leaving your server. The end result: a telephone-style AI assistant that processes voice in under 3 seconds end-to-end on dedicated GPU infrastructure.
Pipeline Architecture
| Stage | Tool | Input | Output | VRAM |
|---|---|---|---|---|
| 1. Speech-to-Text | Whisper Large v3 | Audio (WAV/MP3) | Transcribed text | ~3GB |
| 2. Reasoning | LLaMA 3.1 8B (Q4) | Transcribed text | Response text | ~6GB |
| 3. Text-to-Speech | Coqui XTTS v2 | Response text | Audio (WAV) | ~2GB |
Total VRAM: approximately 11GB. Fits on a single 24GB GPU with room for batching. Load all three models simultaneously to avoid swap overhead.
Environment and Model Setup
# Install dependencies
pip install faster-whisper vllm TTS fastapi python-multipart uvicorn soundfile
# Start vLLM for the reasoning LLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization gptq --gpu-memory-utilization 0.3 \
--port 8000 &
The Whisper model and Coqui TTS load within the application process, while vLLM runs as a separate service. Setting gpu-memory-utilization 0.3 reserves VRAM for the other models.
Stage 1: Speech-to-Text with Whisper
from faster_whisper import WhisperModel
# Load Whisper (uses ~3GB VRAM)
whisper_model = WhisperModel("large-v3", device="cuda", compute_type="float16")
def transcribe_audio(audio_path: str) -> str:
segments, info = whisper_model.transcribe(audio_path, beam_size=5)
text = " ".join([seg.text for seg in segments])
return text.strip()
# Example: transcribe a customer query
transcript = transcribe_audio("/tmp/customer_query.wav")
# Output: "What are your opening hours on bank holidays?"
Stage 2: LLM Reasoning
import requests
def generate_response(transcript: str, context: str = "") -> str:
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful voice assistant. "
"Keep responses concise (under 3 sentences) as they will be spoken aloud."},
{"role": "user", "content": transcript}
],
"max_tokens": 150,
"temperature": 0.3
})
return response.json()["choices"][0]["message"]["content"]
The system prompt instructs the model to keep responses short — long text produces awkwardly long audio. For domain-specific answers, add RAG context using ChromaDB retrieval.
Stage 3: Text-to-Speech with Coqui
from TTS.api import TTS
# Load Coqui XTTS v2 (~2GB VRAM)
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
def synthesise_speech(text: str, output_path: str, speaker_wav: str = None):
tts.tts_to_file(
text=text,
file_path=output_path,
speaker_wav=speaker_wav, # Clone voice from reference audio
language="en"
)
return output_path
# Generate spoken response
audio_file = synthesise_speech(
"Our bank holiday hours are 10 AM to 4 PM.",
"/tmp/response.wav"
)
Combined Voice API
from fastapi import FastAPI, UploadFile
from fastapi.responses import FileResponse
import tempfile, os
app = FastAPI()
@app.post("/voice-query")
async def voice_query(audio: UploadFile):
# Save uploaded audio
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
f.write(await audio.read())
input_path = f.name
# Pipeline: STT -> LLM -> TTS
transcript = transcribe_audio(input_path)
response_text = generate_response(transcript)
output_path = input_path.replace(".wav", "_response.wav")
synthesise_speech(response_text, output_path)
os.unlink(input_path)
return FileResponse(output_path, media_type="audio/wav")
The complete voice-to-voice round trip typically completes in 2-4 seconds on an RTX 6000 Pro GPU. Teams deploying customer-facing assistants can integrate this with telephony systems via WebSocket for real-time streaming.
Latency Optimisation
For production voice agents, reduce latency by using Whisper’s faster-whisper implementation with CTranslate2 for 2-4x speed gain; streaming TTS output chunk-by-chunk rather than waiting for full synthesis; running the LLM with smaller quantised models when sub-second reasoning is needed; and pre-loading all models at startup so the first request is not penalised. Monitor end-to-end latency per stage to identify bottlenecks. See more pipeline tutorials, explore private hosting for data-sensitive voice applications, and check industry use cases for voice agent deployment scenarios.
Voice AI GPU Servers
Dedicated GPU servers for real-time voice pipelines. Run Whisper, LLMs, and TTS on a single isolated server. UK-hosted.
Browse GPU Servers