You will build a pipeline that takes spoken audio in one language, transcribes it with Whisper, translates it with an LLM, and synthesises the translated text as natural speech. The end result: upload a 5-minute German customer call and receive an English audio translation within 90 seconds. No cloud translation APIs, no per-minute charges, no audio data leaving your server. Here is the complete multilingual pipeline on dedicated GPU infrastructure.
Pipeline Architecture
| Stage | Tool | Input | Output | VRAM |
|---|---|---|---|---|
| 1. Transcription | Whisper Large v3 | Source language audio | Source language text | ~3GB |
| 2. Translation | LLaMA 3.1 8B | Source text | Target language text | ~6GB |
| 3. Synthesis | Coqui XTTS v2 | Translated text | Target language audio | ~2GB |
Stage 1: Multilingual Transcription
from faster_whisper import WhisperModel
whisper = WhisperModel("large-v3", device="cuda", compute_type="float16")
def transcribe(audio_path: str) -> dict:
segments, info = whisper.transcribe(audio_path, beam_size=5)
text_segments = []
for seg in segments:
text_segments.append({
"start": seg.start, "end": seg.end, "text": seg.text
})
return {
"language": info.language,
"text": " ".join([s["text"] for s in text_segments]),
"segments": text_segments
}
Whisper Large v3 automatically detects the source language and transcribes with timestamps. The segment-level output preserves timing for subtitle generation.
Stage 2: LLM Translation
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
def translate_text(text: str, source_lang: str, target_lang: str = "English") -> str:
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{
"role": "system",
"content": f"Translate the following {source_lang} text to {target_lang}. "
f"Preserve the original meaning, tone, and technical terminology. "
f"Return only the translation, no commentary."
}, {"role": "user", "content": text}],
max_tokens=2000, temperature=0.2
)
return response.choices[0].message.content
LLMs produce higher-quality translations than traditional MT systems for conversational and domain-specific content because they understand context. The vLLM server handles batched translation of multiple segments efficiently.
Stage 3: Speech Synthesis
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
def synthesise(text: str, output_path: str, target_lang: str = "en"):
tts.tts_to_file(text=text, file_path=output_path, language=target_lang)
return output_path
Coqui XTTS supports multiple output languages with natural prosody. For voice cloning, provide a reference audio sample from the target speaker.
Combined Translation Endpoint
from fastapi import FastAPI, UploadFile
from fastapi.responses import FileResponse
import tempfile
app = FastAPI()
@app.post("/translate-audio")
async def translate_audio(audio: UploadFile, target_lang: str = "English"):
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
f.write(await audio.read())
input_path = f.name
# Pipeline
transcript = transcribe(input_path)
translated = translate_text(transcript["text"], transcript["language"], target_lang)
lang_code = {"English": "en", "French": "fr", "Spanish": "es", "German": "de"}
output_path = input_path.replace(".wav", f"_{target_lang}.wav")
synthesise(translated, output_path, lang_code.get(target_lang, "en"))
return FileResponse(output_path, media_type="audio/wav")
Production Considerations
For production deployments: process long audio files in segments (5-minute chunks) to stay within model context limits; implement a queue for batch processing of multiple files; add language detection validation to catch Whisper misidentification; cache translations of repeated phrases; and monitor translation quality with periodic human review. For domain-specific terminology (medical, legal, technical), add glossary terms to the translation prompt. Deploy on private infrastructure for confidential audio. See model options for multilingual specialists, chatbot hosting for real-time voice interfaces, more tutorials, and industry use cases for translation deployments.
Translation AI GPU Servers
Dedicated GPU servers for multilingual speech pipelines. Run Whisper, LLMs, and TTS on isolated UK infrastructure.
Browse GPU Servers