Cohere’s Aya family is the most credible open multilingual stack available today, and it maps unusually well to the RTX 5060 Ti 16GB. Aya Expanse 8B, Aya 23 8B, and the older Aya-101 (a 13B mT5-based model covering 101 languages) all sit inside the 16 GB envelope with room for practical context. This guide quantifies VRAM, throughput, and deployment patterns on our UK dedicated GPU hosting.
Contents
- The Aya family
- VRAM footprint
- Throughput on the 5060 Ti
- Language coverage
- Use cases
- Deployment recipe
The Aya family
Aya-101 was Cohere For AI’s 2024 research release: a 13B mT5 fine-tune across 101 languages. Aya 23 (8B and 35B) switched to a decoder-only Command-R base and narrowed to 23 high-resource languages with much stronger quality. Aya Expanse 8B, the current default, is a post-trained successor that combines multilingual RLHF with model merging and is what most new deployments should target.
| Model | Params | Languages | Architecture | Typical role |
|---|---|---|---|---|
| Aya-101 | 13B | 101 | mT5 encoder-decoder | Wide coverage, low-resource languages |
| Aya 23 8B | 8B | 23 | Command-R decoder | Balanced quality/throughput |
| Aya 23 35B | 35B | 23 | Command-R decoder | Needs 48+ GB card |
| Aya Expanse 8B | 8B | 23 | Command-R decoder | Recommended default on 16 GB |
| Aya Expanse 32B | 32B | 23 | Command-R decoder | Needs RTX 5090/6000 Pro |
VRAM footprint
The 16 GB of GDDR7 on the 5060 Ti comfortably holds all 8B variants at FP8, and the older Aya-101 at FP8 too. Aya 23 35B and Expanse 32B overflow.
| Model | FP16 weights | FP8 weights | AWQ INT4 | Fits 16 GB (FP8)? |
|---|---|---|---|---|
| Aya Expanse 8B | 16.0 GB | 8.1 GB | 5.4 GB | Yes, 8k context |
| Aya 23 8B | 16.0 GB | 8.1 GB | 5.4 GB | Yes, 8k context |
| Aya-101 (13B mT5) | 26 GB | 13 GB | 7.8 GB | FP8 tight, INT4 comfortable |
| Aya Expanse 32B | 64 GB | 32 GB | 18 GB | No |
| Aya 23 35B | 70 GB | 35 GB | 20 GB | No |
Throughput on the 5060 Ti
The Blackwell native FP8 path is what makes this card viable for 8B multilingual workloads. Measured on vLLM 0.6 with a batch size of 1 and 2k output tokens:
| Model | Precision | Tokens/s (bs=1) | Tokens/s (bs=8) | First-token latency |
|---|---|---|---|---|
| Aya Expanse 8B | FP8 | ~100 | ~540 | 95 ms |
| Aya 23 8B | FP8 | ~102 | ~550 | 92 ms |
| Aya Expanse 8B | AWQ INT4 | ~135 | ~620 | 80 ms |
| Aya-101 (13B) | FP8 | ~55 | ~240 | 140 ms |
For a practical sizing comparison against other 8B models, see the Llama 3 8B benchmark and the 8B VRAM requirements page.
Language coverage
Pick the model by language need, not by parameter count. Aya-101 remains the best open option for low-resource languages such as Welsh, Scots Gaelic, Swahili, or Yoruba. Aya Expanse 8B covers English, French, German, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Russian, Ukrainian, Turkish, Arabic, Hebrew, Persian, Hindi, Indonesian, Vietnamese, Chinese, Japanese, Korean, Greek, and Romanian at strong quality.
Use cases
- Customer support translation: EN to/from 22 other locales, 100 t/s per stream.
- Multilingual RAG: pair with a multilingual embedding model; Aya Expanse 8B answers in the source language.
- Localisation QA: score machine-translated strings against reference for 23 languages.
- Chat: single-card deployment serving ~20 concurrent users at FP8 with 4k context.
If translation is the sole workload and quality matters more than chat fluency, compare against NLLB-200 on the same card and the broader translation hosting guide.
Deployment recipe
Serve Aya Expanse 8B with vLLM, FP8 weights, 8k context, --max-num-seqs 32. The context ceiling is governed by KV cache; see the context budget article for exact token maths. When you need the 32B Expanse model, the upgrade path is the RTX 5090 32GB or the RTX 6000 Pro 96GB.
Multilingual LLM hosting without the guesswork
Aya Expanse 8B at 100 tokens/s, FP8, 16 GB of GDDR7, 180 W. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: 5060 Ti for translation, NLLB-200 hosting, Qwen 14B benchmark, max model size.