Table of Contents
LLM tokenizers handle different languages with different efficiency. English is most efficient (training data dominance); French / German / Spanish similar; Asian languages (Japanese, Korean, Chinese) often 2-3× more tokens per character. This affects cost and context-window economics.
Tokens per character: English ~0.3, European ~0.4, Asian ~0.7-1.0. Per-token cost is constant; per-character cost varies. For multilingual production: budget 2-3× tokens for Asian content vs English. Qwen tokenizer is more efficient on Chinese / Japanese than Llama; pick by language mix.
Efficiency
Approximate tokens per character on Llama 3 tokenizer:
- English: 0.25-0.30 tokens/char
- French / Spanish / German: 0.30-0.40
- Russian (Cyrillic): 0.50-0.60
- Arabic: 0.50-0.70
- Japanese: 0.70-0.90
- Chinese (simplified): 0.60-0.80
- Korean: 0.80-1.00
Implications: a 32K-token context fits ~120K English characters or ~40K Japanese characters. Budget accordingly.
Families
- Llama 3 tokenizer: English-leaning; reasonable for European; less efficient on Asian
- Qwen 2.5 tokenizer: native multilingual including strong Chinese / Japanese efficiency
- Mistral tokenizer: European-leaning
- BPE vs SentencePiece: Qwen / Llama use BPE; BGE-m3 uses SentencePiece for multilingual
Verdict
For multilingual production AI, tokenizer choice affects cost / capacity. Qwen 2.5 family for Asian-heavy workloads; Llama / Mistral for English-heavy. Budget 2-3× tokens for non-English content; size context-window plans accordingly.
Bottom line
Pick tokenizer by language mix. See Qwen multilingual.