Your Speech-to-Speech Pipeline Takes Too Long
You have built a voice pipeline — audio in via Whisper, processed through an LLM, and spoken back via TTS — but the round-trip latency is 8-15 seconds. Users experience an uncomfortable pause between finishing their sentence and hearing a response. Conversational voice interfaces need sub-2-second latency to feel natural. On a dedicated GPU server, this target is achievable with the right architecture.
Where the Latency Hides
Profile each stage to identify your biggest bottleneck:
import time
# Stage 1: Audio capture and preprocessing
t0 = time.time()
audio = capture_audio() # Typically 0.1-0.5s for VAD endpoint detection
t1 = time.time()
# Stage 2: Whisper transcription
transcript = whisper_model.transcribe(audio) # 0.5-3s depending on model/length
t2 = time.time()
# Stage 3: LLM response generation
response = llm.generate(transcript) # 0.3-2s for first tokens
t3 = time.time()
# Stage 4: TTS synthesis
audio_out = tts.synthesize(response) # 0.5-2s for full utterance
t4 = time.time()
# Stage 5: Audio playback start
play(audio_out) # Buffering delay: 0.1-0.3s
t5 = time.time()
print(f"Capture: {t1-t0:.3f}s")
print(f"Transcribe: {t2-t1:.3f}s")
print(f"LLM: {t3-t2:.3f}s")
print(f"TTS: {t4-t3:.3f}s")
print(f"Playback: {t5-t4:.3f}s")
print(f"Total: {t5-t0:.3f}s")
Stream Whisper Transcription
Do not wait for the full utterance before starting transcription. Process audio in chunks as it arrives:
from faster_whisper import WhisperModel
import numpy as np
model = WhisperModel("medium", device="cuda", compute_type="int8")
class StreamingTranscriber:
def __init__(self, model, chunk_duration=1.0, sr=16000):
self.model = model
self.buffer = np.array([], dtype=np.float32)
self.chunk_size = int(chunk_duration * sr)
self.sr = sr
def feed_audio(self, chunk):
self.buffer = np.concatenate([self.buffer, chunk])
if len(self.buffer) >= self.chunk_size:
# Transcribe accumulated audio
segments, _ = self.model.transcribe(
self.buffer, language="en", vad_filter=True)
text = " ".join([s.text for s in segments])
self.buffer = np.array([], dtype=np.float32)
return text
return None # Not enough audio yet
# Use 'medium' instead of 'large-v3' for streaming — 2x faster
# with acceptable accuracy for conversational use
Overlap Pipeline Stages
The critical optimisation: start downstream stages before upstream stages finish. Stream LLM tokens directly into TTS:
import asyncio
async def streaming_voice_pipeline(audio_input):
# Start transcription immediately
transcript = await transcribe_streaming(audio_input)
# Stream LLM response tokens as they arrive
tts_buffer = []
async for token in llm.stream_generate(transcript):
tts_buffer.append(token)
text_so_far = "".join(tts_buffer)
# When we have a complete sentence, synthesise and play it
if text_so_far.rstrip().endswith(('.', '!', '?', ':')):
audio_chunk = await tts.synthesize_async(text_so_far)
await play_audio_async(audio_chunk)
tts_buffer = []
# Handle any remaining text
if tts_buffer:
remaining = "".join(tts_buffer)
audio_chunk = await tts.synthesize_async(remaining)
await play_audio_async(audio_chunk)
# Latency comparison:
# Sequential: Whisper(2s) + LLM(1.5s) + TTS(1s) = 4.5s to first audio
# Streamed: Whisper(2s) + LLM_first_sentence(0.3s) + TTS_sentence(0.2s) = 2.5s
Co-Locate Models on GPU
Keep all three models loaded on the same GPU to eliminate data transfer overhead:
# Memory budget on a 24 GB GPU (RTX 5090):
# faster-whisper medium INT8: ~1.5 GB
# LLM 7B quantised INT4: ~4 GB
# XTTS v2: ~2 GB
# Total: ~7.5 GB (plenty of headroom)
# Load all models at startup — never unload between requests
whisper_model = WhisperModel("medium", device="cuda", compute_type="int8")
llm = load_llm("model.gguf", n_gpu_layers=-1)
tts_model = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
# Use CUDA streams for concurrent execution where possible
import torch
whisper_stream = torch.cuda.Stream()
tts_stream = torch.cuda.Stream()
Optimise Voice Activity Detection
Aggressive endpoint detection shaves hundreds of milliseconds off perceived latency:
# Silero VAD for fast endpoint detection
import torch
vad_model, utils = torch.hub.load('snakers4/silero-vad', 'silero_vad')
(get_speech_timestamps, _, _, _, _) = utils
# Tune for low-latency endpoint detection
def detect_speech_end(audio_chunk, sr=16000):
speech_timestamps = get_speech_timestamps(
audio_chunk,
vad_model,
threshold=0.5,
min_silence_duration_ms=300, # Shorter = faster endpoint
speech_pad_ms=100
)
if speech_timestamps and len(audio_chunk)/sr - speech_timestamps[-1]['end']/sr > 0.3:
return True # Speaker has stopped
return False
For sub-second voice interactions on your GPU server, the combination of faster-whisper, streamed LLM inference, and sentence-level TTS synthesis achieves round-trip latency under 2 seconds. The Whisper hosting and Coqui TTS hosting pages cover individual component setup. Check the tutorials section for architecture patterns, benchmarks for latency numbers, and our vLLM production guide for the LLM serving layer.
Low-Latency Voice AI
Build real-time voice pipelines on GigaGPU. Co-locate Whisper, LLM, and TTS on a single high-performance GPU.
Browse GPU Servers