RTX 3050 - Order Now
Home / Blog / Tutorials / Voice Agent Pipeline with Whisper, LLM, and Coqui TTS
Tutorials

Voice Agent Pipeline with Whisper, LLM, and Coqui TTS

Build a voice-to-voice AI agent combining Whisper for speech recognition, an LLM for reasoning, and Coqui TTS for speech synthesis — all self-hosted on a single GPU server.

You will build a voice agent that listens to a spoken question, transcribes it with Whisper, generates an answer with a self-hosted LLM, and speaks the response aloud using Coqui TTS. The complete pipeline runs on a single GPU — no cloud speech APIs, no third-party transcription services, no data leaving your server. The end result: a telephone-style AI assistant that processes voice in under 3 seconds end-to-end on dedicated GPU infrastructure.

Pipeline Architecture

StageToolInputOutputVRAM
1. Speech-to-TextWhisper Large v3Audio (WAV/MP3)Transcribed text~3GB
2. ReasoningLLaMA 3.1 8B (Q4)Transcribed textResponse text~6GB
3. Text-to-SpeechCoqui XTTS v2Response textAudio (WAV)~2GB

Total VRAM: approximately 11GB. Fits on a single 24GB GPU with room for batching. Load all three models simultaneously to avoid swap overhead.

Environment and Model Setup

# Install dependencies
pip install faster-whisper vllm TTS fastapi python-multipart uvicorn soundfile

# Start vLLM for the reasoning LLM
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization gptq --gpu-memory-utilization 0.3 \
  --port 8000 &

The Whisper model and Coqui TTS load within the application process, while vLLM runs as a separate service. Setting gpu-memory-utilization 0.3 reserves VRAM for the other models.

Stage 1: Speech-to-Text with Whisper

from faster_whisper import WhisperModel

# Load Whisper (uses ~3GB VRAM)
whisper_model = WhisperModel("large-v3", device="cuda", compute_type="float16")

def transcribe_audio(audio_path: str) -> str:
    segments, info = whisper_model.transcribe(audio_path, beam_size=5)
    text = " ".join([seg.text for seg in segments])
    return text.strip()

# Example: transcribe a customer query
transcript = transcribe_audio("/tmp/customer_query.wav")
# Output: "What are your opening hours on bank holidays?"

Stage 2: LLM Reasoning

import requests

def generate_response(transcript: str, context: str = "") -> str:
    response = requests.post("http://localhost:8000/v1/chat/completions", json={
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful voice assistant. "
             "Keep responses concise (under 3 sentences) as they will be spoken aloud."},
            {"role": "user", "content": transcript}
        ],
        "max_tokens": 150,
        "temperature": 0.3
    })
    return response.json()["choices"][0]["message"]["content"]

The system prompt instructs the model to keep responses short — long text produces awkwardly long audio. For domain-specific answers, add RAG context using ChromaDB retrieval.

Stage 3: Text-to-Speech with Coqui

from TTS.api import TTS

# Load Coqui XTTS v2 (~2GB VRAM)
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

def synthesise_speech(text: str, output_path: str, speaker_wav: str = None):
    tts.tts_to_file(
        text=text,
        file_path=output_path,
        speaker_wav=speaker_wav,  # Clone voice from reference audio
        language="en"
    )
    return output_path

# Generate spoken response
audio_file = synthesise_speech(
    "Our bank holiday hours are 10 AM to 4 PM.",
    "/tmp/response.wav"
)

Combined Voice API

from fastapi import FastAPI, UploadFile
from fastapi.responses import FileResponse
import tempfile, os

app = FastAPI()

@app.post("/voice-query")
async def voice_query(audio: UploadFile):
    # Save uploaded audio
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        f.write(await audio.read())
        input_path = f.name

    # Pipeline: STT -> LLM -> TTS
    transcript = transcribe_audio(input_path)
    response_text = generate_response(transcript)
    output_path = input_path.replace(".wav", "_response.wav")
    synthesise_speech(response_text, output_path)

    os.unlink(input_path)
    return FileResponse(output_path, media_type="audio/wav")

The complete voice-to-voice round trip typically completes in 2-4 seconds on an RTX 6000 Pro GPU. Teams deploying customer-facing assistants can integrate this with telephony systems via WebSocket for real-time streaming.

Latency Optimisation

For production voice agents, reduce latency by using Whisper’s faster-whisper implementation with CTranslate2 for 2-4x speed gain; streaming TTS output chunk-by-chunk rather than waiting for full synthesis; running the LLM with smaller quantised models when sub-second reasoning is needed; and pre-loading all models at startup so the first request is not penalised. Monitor end-to-end latency per stage to identify bottlenecks. See more pipeline tutorials, explore private hosting for data-sensitive voice applications, and check industry use cases for voice agent deployment scenarios.

Voice AI GPU Servers

Dedicated GPU servers for real-time voice pipelines. Run Whisper, LLMs, and TTS on a single isolated server. UK-hosted.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?