Yi 34B from 01.ai is a bilingual (English/Chinese) model with solid reasoning and up to 200k context in its extended variants. On a RTX 6000 Pro 96GB from our dedicated GPU hosting, it runs at FP16 natively with comfortable serving headroom.
Contents
VRAM
| Precision | Weights | Fits On 96 GB |
|---|---|---|
| FP16 | ~68 GB | Yes, with KV cache room |
| FP8 | ~34 GB | Yes, lots of room |
| AWQ INT4 | ~19 GB | Yes, with very high concurrency |
Deployment
python -m vllm.entrypoints.openai.api_server \
--model 01-ai/Yi-1.5-34B-Chat \
--dtype bfloat16 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching
bfloat16 is preferable to float16 for Yi – the model was trained in bf16 and precision is more stable.
Long-Context Variant
Yi-34B-200K extends context to 200,000 tokens. KV cache at 200k is enormous – a single full-context sequence needs ~80-90 GB with FP16 KV cache. Use --kv-cache-dtype fp8 to halve this. Even then, multi-user 200k on a 96 GB card is unrealistic; it is a single-user-per-request workload.
python -m vllm.entrypoints.openai.api_server \
--model 01-ai/Yi-34B-200K \
--max-model-len 200000 \
--kv-cache-dtype fp8 \
--max-num-seqs 2
When Yi Fits
Pick Yi 34B when:
- You need bilingual English/Chinese performance
- You need genuinely long context (use the 200K variant)
- You want a 34B-class model that is not Qwen or Gemma
For pure English workloads Qwen 2.5 72B and Llama 3.3 70B typically edge Yi out on reasoning benchmarks.
See Mistral Nemo 12B for a smaller long-context option.