RTX 3050 - Order Now
Home / Blog / Tutorials / Speech Translation Pipeline with Whisper, LLM, and TTS
Tutorials

Speech Translation Pipeline with Whisper, LLM, and TTS

Build a speech-to-speech translation pipeline using Whisper for transcription, an LLM for translation, and TTS for output synthesis across multiple languages on a GPU server.

You will build a pipeline that takes spoken audio in one language, transcribes it with Whisper, translates it with an LLM, and synthesises the translated text as natural speech. The end result: upload a 5-minute German customer call and receive an English audio translation within 90 seconds. No cloud translation APIs, no per-minute charges, no audio data leaving your server. Here is the complete multilingual pipeline on dedicated GPU infrastructure.

Pipeline Architecture

StageToolInputOutputVRAM
1. TranscriptionWhisper Large v3Source language audioSource language text~3GB
2. TranslationLLaMA 3.1 8BSource textTarget language text~6GB
3. SynthesisCoqui XTTS v2Translated textTarget language audio~2GB

Stage 1: Multilingual Transcription

from faster_whisper import WhisperModel

whisper = WhisperModel("large-v3", device="cuda", compute_type="float16")

def transcribe(audio_path: str) -> dict:
    segments, info = whisper.transcribe(audio_path, beam_size=5)
    text_segments = []
    for seg in segments:
        text_segments.append({
            "start": seg.start, "end": seg.end, "text": seg.text
        })
    return {
        "language": info.language,
        "text": " ".join([s["text"] for s in text_segments]),
        "segments": text_segments
    }

Whisper Large v3 automatically detects the source language and transcribes with timestamps. The segment-level output preserves timing for subtitle generation.

Stage 2: LLM Translation

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

def translate_text(text: str, source_lang: str, target_lang: str = "English") -> str:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{
            "role": "system",
            "content": f"Translate the following {source_lang} text to {target_lang}. "
                       f"Preserve the original meaning, tone, and technical terminology. "
                       f"Return only the translation, no commentary."
        }, {"role": "user", "content": text}],
        max_tokens=2000, temperature=0.2
    )
    return response.choices[0].message.content

LLMs produce higher-quality translations than traditional MT systems for conversational and domain-specific content because they understand context. The vLLM server handles batched translation of multiple segments efficiently.

Stage 3: Speech Synthesis

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

def synthesise(text: str, output_path: str, target_lang: str = "en"):
    tts.tts_to_file(text=text, file_path=output_path, language=target_lang)
    return output_path

Coqui XTTS supports multiple output languages with natural prosody. For voice cloning, provide a reference audio sample from the target speaker.

Combined Translation Endpoint

from fastapi import FastAPI, UploadFile
from fastapi.responses import FileResponse
import tempfile

app = FastAPI()

@app.post("/translate-audio")
async def translate_audio(audio: UploadFile, target_lang: str = "English"):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        f.write(await audio.read())
        input_path = f.name

    # Pipeline
    transcript = transcribe(input_path)
    translated = translate_text(transcript["text"], transcript["language"], target_lang)
    lang_code = {"English": "en", "French": "fr", "Spanish": "es", "German": "de"}
    output_path = input_path.replace(".wav", f"_{target_lang}.wav")
    synthesise(translated, output_path, lang_code.get(target_lang, "en"))

    return FileResponse(output_path, media_type="audio/wav")

Production Considerations

For production deployments: process long audio files in segments (5-minute chunks) to stay within model context limits; implement a queue for batch processing of multiple files; add language detection validation to catch Whisper misidentification; cache translations of repeated phrases; and monitor translation quality with periodic human review. For domain-specific terminology (medical, legal, technical), add glossary terms to the translation prompt. Deploy on private infrastructure for confidential audio. See model options for multilingual specialists, chatbot hosting for real-time voice interfaces, more tutorials, and industry use cases for translation deployments.

Translation AI GPU Servers

Dedicated GPU servers for multilingual speech pipelines. Run Whisper, LLMs, and TTS on isolated UK infrastructure.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?