Home / Blog / Tutorials / Voice Agent Pipeline with Whisper, LLM, and Coqui TTS

Tutorials

Voice Agent Pipeline with Whisper, LLM, and Coqui TTS

Build a voice-to-voice AI agent combining Whisper for speech recognition, an LLM for reasoning, and Coqui TTS for speech synthesis — all self-hosted on a single GPU server.

Tutorials April 16, 2026 3 min read admin

You will build a voice agent that listens to a spoken question, transcribes it with Whisper, generates an answer with a self-hosted LLM, and speaks the response aloud using Coqui TTS. The complete pipeline runs on a single GPU — no cloud speech APIs, no third-party transcription services, no data leaving your server. The end result: a telephone-style AI assistant that processes voice in under 3 seconds end-to-end on dedicated GPU infrastructure.

Pipeline Architecture

Stage	Tool	Input	Output	VRAM
1. Speech-to-Text	Whisper Large v3	Audio (WAV/MP3)	Transcribed text	~3GB
2. Reasoning	LLaMA 3.1 8B (Q4)	Transcribed text	Response text	~6GB
3. Text-to-Speech	Coqui XTTS v2	Response text	Audio (WAV)	~2GB

Total VRAM: approximately 11GB. Fits on a single 24GB GPU with room for batching. Load all three models simultaneously to avoid swap overhead.

Environment and Model Setup

# Install dependencies
pip install faster-whisper vllm TTS fastapi python-multipart uvicorn soundfile

# Start vLLM for the reasoning LLM
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization gptq --gpu-memory-utilization 0.3 \
  --port 8000 &

The Whisper model and Coqui TTS load within the application process, while vLLM runs as a separate service. Setting gpu-memory-utilization 0.3 reserves VRAM for the other models.

Stage 1: Speech-to-Text with Whisper

from faster_whisper import WhisperModel

# Load Whisper (uses ~3GB VRAM)
whisper_model = WhisperModel("large-v3", device="cuda", compute_type="float16")

def transcribe_audio(audio_path: str) -> str:
    segments, info = whisper_model.transcribe(audio_path, beam_size=5)
    text = " ".join([seg.text for seg in segments])
    return text.strip()

# Example: transcribe a customer query
transcript = transcribe_audio("/tmp/customer_query.wav")
# Output: "What are your opening hours on bank holidays?"

Stage 2: LLM Reasoning

import requests

def generate_response(transcript: str, context: str = "") -> str:
    response = requests.post("http://localhost:8000/v1/chat/completions", json={
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful voice assistant. "
             "Keep responses concise (under 3 sentences) as they will be spoken aloud."},
            {"role": "user", "content": transcript}
        ],
        "max_tokens": 150,
        "temperature": 0.3
    })
    return response.json()["choices"][0]["message"]["content"]

The system prompt instructs the model to keep responses short — long text produces awkwardly long audio. For domain-specific answers, add RAG context using ChromaDB retrieval.

Stage 3: Text-to-Speech with Coqui

from TTS.api import TTS

# Load Coqui XTTS v2 (~2GB VRAM)
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

def synthesise_speech(text: str, output_path: str, speaker_wav: str = None):
    tts.tts_to_file(
        text=text,
        file_path=output_path,
        speaker_wav=speaker_wav,  # Clone voice from reference audio
        language="en"
    )
    return output_path

# Generate spoken response
audio_file = synthesise_speech(
    "Our bank holiday hours are 10 AM to 4 PM.",
    "/tmp/response.wav"
)

Combined Voice API

from fastapi import FastAPI, UploadFile
from fastapi.responses import FileResponse
import tempfile, os

app = FastAPI()

@app.post("/voice-query")
async def voice_query(audio: UploadFile):
    # Save uploaded audio
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        f.write(await audio.read())
        input_path = f.name

    # Pipeline: STT -> LLM -> TTS
    transcript = transcribe_audio(input_path)
    response_text = generate_response(transcript)
    output_path = input_path.replace(".wav", "_response.wav")
    synthesise_speech(response_text, output_path)

    os.unlink(input_path)
    return FileResponse(output_path, media_type="audio/wav")

The complete voice-to-voice round trip typically completes in 2-4 seconds on an RTX 6000 Pro GPU. Teams deploying customer-facing assistants can integrate this with telephony systems via WebSocket for real-time streaming.

Latency Optimisation

For production voice agents, reduce latency by using Whisper’s faster-whisper implementation with CTranslate2 for 2-4x speed gain; streaming TTS output chunk-by-chunk rather than waiting for full synthesis; running the LLM with smaller quantised models when sub-second reasoning is needed; and pre-loading all models at startup so the first request is not penalised. Monitor end-to-end latency per stage to identify bottlenecks. See more pipeline tutorials, explore private hosting for data-sensitive voice applications, and check industry use cases for voice agent deployment scenarios.

Voice AI GPU Servers

Dedicated GPU servers for real-time voice pipelines. Run Whisper, LLMs, and TTS on a single isolated server. UK-hosted.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Voice Agent Pipeline with Whisper, LLM, and Coqui TTS

Pipeline Architecture

Environment and Model Setup

Stage 1: Speech-to-Text with Whisper

Stage 2: LLM Reasoning

Stage 3: Text-to-Speech with Coqui

Combined Voice API

Latency Optimisation

Voice AI GPU Servers

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Voice Agent Pipeline with Whisper, LLM, and Coqui TTS

Pipeline Architecture

Environment and Model Setup

Stage 1: Speech-to-Text with Whisper

Stage 2: LLM Reasoning

Stage 3: Text-to-Speech with Coqui

Combined Voice API

Latency Optimisation

Voice AI GPU Servers

Need a Dedicated GPU Server?

admin

Related Articles

Migrate from Google Vertex to Dedicated GPU: Conversational AI Guide

Connect Freshdesk to Self-Hosted AI on GPU

How to Configure Nginx Reverse Proxy for AI Inference APIs

Migrate from Together.ai to Dedicated GPU: RAG Pipeline

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?