Yi-1.5-9B from 01.AI is a solid open mid-tier LLM with particularly strong bilingual (Chinese + English) performance. It fits comfortably on the RTX 5060 Ti 16GB at our UK dedicated GPU hosting, with headroom for decent context lengths.
Contents
Model Overview
Yi-1.5-9B-Chat is 01.AI’s updated instruction-tuned variant. 48 layers, 4 KV heads (GQA), 128 head dim, native context 4k with extensions to 16k and 32k available. Licensed under the Yi Series Model License – permissive for most commercial use with an attribution requirement.
VRAM and Fit
| Precision | Weights | Fits 16 GB with KV? |
|---|---|---|
| FP16 | 18 GB | Does not fit |
| FP8 E4M3 | 9.2 GB | Yes, 4 GB left for KV |
| FP8 + FP8 KV cache | 9.2 GB | Yes, 16k context comfortable |
| AWQ INT4 | 5.8 GB | Yes, 32k context workable |
| GGUF Q4_K_M | 5.2 GB | Yes |
Throughput
- FP8 + FP8 KV decode at batch 1: ~95 t/s
- AWQ INT4 decode at batch 1: ~115 t/s
- Aggregate at batch 16 (FP8): ~490 t/s
- Prefill (FP8): ~5,200 tokens/sec
Slightly slower than Llama 3.1 8B per token because of the larger layer count (48 vs 32), but comparable overall performance.
Deployment
python -m vllm.entrypoints.openai.api_server \
--model 01-ai/Yi-1.5-9B-Chat \
--quantization fp8 \
--kv-cache-dtype fp8 \
--max-model-len 16384 \
--enable-prefix-caching \
--gpu-memory-utilization 0.90
Uses the standard apply_chat_template path in Transformers – no custom formatting required.
Yi vs Peers at the 9-12B Tier
| Model | MMLU | HumanEval | Bilingual (zh/en) | Licence |
|---|---|---|---|---|
| Yi-1.5-9B-Chat | 69.5 | 57.3 | Very strong | Yi License (permissive) |
| Gemma 2 9B-it | 71.3 | 40.2 | English-first | Gemma Terms |
| Mistral Nemo 12B | 68.0 | 40.0 | European focus | Apache 2.0 |
| Qwen 2.5 7B-Instruct | 74.8 | 84.8 | Strong CJK | Qwen License |
When to Pick Yi
- Bilingual Chinese + English products where you want a permissive licence
- General chat where you want quality comparable to Gemma 2 9B at similar VRAM
- Long-context tasks – the 32k variant is a clean fit at AWQ INT4 + FP8 KV
- As a backup / alternative to Gemma 2 9B if its licence terms don’t suit you
For English-only workloads Gemma 2 9B or Qwen 2.5 7B-Instruct usually win. For stronger coding, pick Qwen 2.5 Coder 14B AWQ.
Yi 9B on Blackwell 16GB
Bilingual 9B at ~95 t/s FP8. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Gemma 2 9B benchmark, Mistral Nemo 12B, Qwen 2.5 7B, FP8 deployment, AWQ guide.