Self-hosted machine translation on the RTX 5060 Ti 16GB at our hosting replaces per-character API costs from commercial providers.
Contents
Translation Models
| Model | Strength | VRAM |
|---|---|---|
| Qwen 2.5 14B AWQ | SOTA for open multilingual | 9 GB |
| Llama 3.1 8B FP8 | Strong European languages | 8 GB |
| Cohere Aya 23 8B | 101 languages, fluent | 8 GB |
| NLLB-200-3.3B (specialised MT) | 200 languages, fast | 7 GB |
| Mistral Nemo 12B FP8 | Formal-register EU languages | 12.5 GB |
Throughput
- Qwen 2.5 14B AWQ: ~70 t/s decode – 500-word article in ~10 s
- Llama 3.1 8B FP8: 112 t/s – same article in 6 s
- NLLB-200-3.3B: 350 tokens/s single – fastest pure MT
- Batched high-throughput (book-scale translation): 700+ aggregate t/s on Llama 3 8B
For a 100k-word book at batch 32: ~2 hours end-to-end. Commercial API cost for same volume: £40-100 depending on provider.
Quality
- Major EU pairs (EN-FR, EN-DE, EN-ES): Qwen 14B or Aya 23 match DeepL for most content
- CJK (EN-ZH, EN-JA, EN-KO): Qwen 14B clearly leads
- Low-resource languages: NLLB specialised is safer
- Literary / creative: larger models pull ahead, but humans still win
Deployment
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct-AWQ \
--quantization awq_marlin \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92
Wrap with a simple service that prompts the LLM with “Translate to [lang]: [text]” or use dedicated translation framing.
Self-Hosted Translation on Blackwell 16GB
Replace DeepL/API costs with flat hosting. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Qwen 2.5 guide, Aya 23, NLLB, Qwen 14B benchmark.