What You’ll Build
In 30 minutes, you will have a production translation API that accepts text in any of 100+ languages and returns accurate translations with support for domain-specific terminology, glossary enforcement, and batch processing. Running open-source translation models on a dedicated GPU server, your API translates 50,000 words per minute at zero per-character cost — making it viable for document-scale translation that would cost thousands through cloud APIs.
Cloud translation services charge $10-$20 per million characters and send every word to third-party servers. For legal documents, medical records, or proprietary content, that creates compliance risks. Self-hosted translation on open-source models keeps all text on your infrastructure while delivering quality that rivals commercial services for supported language pairs.
Architecture Overview
The API offers two translation backends: a dedicated translation model (NLLB-200 or Helsinki-NLP OPUS models) for high-throughput pairs, and an LLM through vLLM for nuanced translation with context awareness. The dedicated model handles straightforward translation at maximum speed, while the LLM path handles documents requiring tone matching, domain expertise, or complex formatting preservation.
A language detection layer automatically identifies source language when not specified. The API accepts plain text, HTML (preserving markup), and JSON payloads (translating string values while preserving keys). Glossary support lets you enforce specific translations for brand names, technical terms, and domain-specific vocabulary.
GPU Requirements
| Use Case | Recommended GPU | VRAM | Throughput |
|---|---|---|---|
| NLLB-200 (3.3B) | RTX 5090 | 24 GB | ~50k words/min |
| NLLB + LLM fallback | RTX 6000 Pro | 40 GB | ~30k words/min |
| LLM-primary (70B) | RTX 6000 Pro 96 GB | 80 GB | ~10k words/min |
NLLB-200 at 3.3B parameters uses roughly 7GB VRAM and handles 200 language pairs with good quality. For premium quality on major language pairs, LLM-based translation with a 70B model produces more natural output. See our self-hosted LLM guide for model selection by language pair.
Step-by-Step Build
Deploy the translation model on your GPU server and build the API with language detection, glossary support, and batch endpoints.
from fastapi import FastAPI
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
app = FastAPI()
model_name = "facebook/nllb-200-3.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name, torch_dtype=torch.float16
).to("cuda")
LANG_CODES = {"en": "eng_Latn", "fr": "fra_Latn", "de": "deu_Latn",
"es": "spa_Latn", "ja": "jpn_Jpan", "zh": "zho_Hans"}
@app.post("/v1/translate")
async def translate(text: str, source_lang: str = "en",
target_lang: str = "fr",
glossary: dict = None):
src_code = LANG_CODES[source_lang]
tgt_code = LANG_CODES[target_lang]
tokenizer.src_lang = src_code
inputs = tokenizer(text, return_tensors="pt", padding=True,
truncation=True, max_length=1024).to("cuda")
with torch.no_grad():
output = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_code),
max_new_tokens=1024
)
translated = tokenizer.decode(output[0], skip_special_tokens=True)
if glossary:
for source_term, target_term in glossary.items():
translated = translated.replace(source_term, target_term)
return {"translated_text": translated,
"source_lang": source_lang,
"target_lang": target_lang}
Add a batch endpoint that accepts arrays of text segments for document-scale translation. The OpenAI-compatible wrapper lets you use the LLM path with chat completion endpoints for context-aware translation. See production setup for request batching and GPU utilisation optimisation.
Quality and Customisation
Measure translation quality with BLEU scores against reference translations for your domain. For technical, legal, or medical content, build domain glossaries that enforce correct terminology. The glossary system handles brand names, product terms, and acronyms that generic translation models mishandle.
For documents requiring human-quality output, use the LLM translation path with context-aware prompting that maintains tone, formality level, and document structure. Post-editing workflows flag low-confidence segments for human review while auto-approving high-confidence translations.
Deploy Your Translation API
A self-hosted translation API eliminates per-character billing and keeps sensitive documents on your infrastructure. Serve internal localisation teams, power multilingual products, or translate documents at scale. Launch on GigaGPU dedicated GPU hosting and translate without limits. Browse more API use cases and tutorials in our library.