01.AI’s Yi-34B-Chat sits in an awkward but valuable spot in the open-weight landscape: too large for FP16 on any consumer card, too small to need tensor parallel, and one of the strongest bilingual English/Chinese models you can self-host. On a single RTX 4090 24GB dedicated server from Gigagpu UK hosting, AWQ INT4 weights occupy about 19 GB, FP8 KV cache is non-negotiable, and decode lands at roughly 55 tokens per second. The fit is genuinely tight – tighter than any other model in this guide series – and the operational margin between a stable serving config and an OOM at the first long prompt is measured in single gigabytes. This guide walks through the VRAM math, the throughput envelope, the configuration that actually holds up under load, and the production gotchas that bite operators who treat Yi-34B as if it were a 14B model with extra parameters.
Contents
- Why Yi-34B on a 4090
- Architecture and VRAM math
- Throughput, latency and context budget
- Bilingual quality benchmarks
- Deployment configuration
- Production gotchas
- Cost per million tokens
- Verdict and alternatives
Why Yi-34B on a 4090
What 01.AI built and why it matters
Yi-34B-Chat is a 34B-parameter dense transformer with 60 layers, 56 attention heads, 8 KV heads under grouped-query attention, 128-dimensional head_dim, and a native 4k context window (extended editions go further but at quality cost). 01.AI trained it on roughly 3.1 trillion tokens of bilingual English/Chinese data, with the chat variant tuned through a carefully balanced SFT mixture and DPO pass. The Yi license is permissive for commercial use with a registration step. Architecturally, the model is closer to Llama 2 than Llama 3 – no RoPE-theta tricks, no extended GQA group sizing, no native function-calling tokens – which makes it predictable to deploy but also means the prefill rate is meaningfully lower than what a comparable Qwen 2.5 32B build will give you.
Why the 4090 is the only consumer card that works
The Ada AD102 die in the RTX 4090 brings 16,384 CUDA cores, 24 GB of GDDR6X at 1,008 GB/s, 72 MB of L2 cache, native 4th-generation FP8 tensor cores at roughly 660 TFLOPS dense, and a 450 W TDP. For Yi-34B AWQ INT4, the bandwidth figure is the headline number: the model has to stream 19 GB of weights through the memory bus once per token, so peak theoretical decode is 1,008 / 19 = ~53 tokens/sec, which matches the measured 55 t/s within margin. No 16 GB consumer card fits the weights. The 3090 fits but lacks native FP8 KV (FP16 KV alone consumes 1.2 GB at 4k and overflows at 8k). The 5080 16GB is too small. Only 4090, 5090, and the 24-32 GB workstation cards have the headroom.
Architecture and VRAM math
Format-by-format footprint
| Precision | Weights | KV @ 4k | KV @ 8k | KV @ 16k | Activations | Total @ 8k | Fits 24 GB? |
|---|---|---|---|---|---|---|---|
| FP16 / BF16 | 68.0 GB | 0.6 GB | 1.2 GB | 2.4 GB | 1.5 GB | 70.7 GB | No |
| FP8 W8A8 | 34.0 GB | 0.6 GB | 1.2 GB | 2.4 GB | 1.5 GB | 36.7 GB | No |
| AWQ INT4 + FP16 KV | 19.0 GB | 1.2 GB | 2.4 GB | 4.8 GB | 1.5 GB | 22.9 GB | Yes (no headroom) |
| AWQ INT4 + FP8 KV | 19.0 GB | 0.6 GB | 1.2 GB | 2.4 GB | 1.5 GB | 21.7 GB | Yes |
| AWQ + FP8 KV @ 12k | 19.0 GB | — | — | 1.8 GB | 1.5 GB | 22.3 GB | Yes (tight) |
| GPTQ INT4 + FP8 KV | 19.4 GB | 0.6 GB | 1.2 GB | 2.4 GB | 1.5 GB | 22.1 GB | Yes (tighter) |
| GGUF Q4_K_M + FP16 KV | 20.5 GB | 1.2 GB | 2.4 GB | — | 1.0 GB | 23.9 GB | Yes (no headroom) |
What the table tells you operationally
The FP16 KV path leaves no room for activations once you pass an 8k prompt; OOM hits on the first long-context request. AWQ INT4 with FP8 KV is the only configuration with realistic operational headroom for a serving deployment, and even that is tight enough that you cannot also run a Whisper sidecar or a small embedding model on the same card. For comparison, see how the much larger Llama 70B INT4 fits with 17 GB of weights, and how Qwen 2.5 32B at 18 GB AWQ has fractionally more headroom thanks to a tighter KV layout.
Throughput, latency and context budget
Aggregate batch sweep, AWQ INT4 + FP8 KV
| Batch | Aggregate t/s | Per-user t/s | p50 TTFT (1k prompt) | p99 TTFT | p50 inter-token | p99 inter-token |
|---|---|---|---|---|---|---|
| 1 | 55 | 55 | 280 ms | 340 ms | 18.2 ms | 22.1 ms |
| 2 | 95 | 47 | 340 ms | 460 ms | 21.0 ms | 26.4 ms |
| 4 | 148 | 37 | 520 ms | 720 ms | 27.1 ms | 36.0 ms |
| 6 | 175 | 29 | 740 ms | 1,080 ms | 34.5 ms | 48.0 ms |
| 8 | OOM at 8k | — | — | — | — | — |
TTFT vs prompt length, batch 1
| Prompt tokens | TTFT | Prefill rate | VRAM at end of prefill | Notes |
|---|---|---|---|---|
| 256 | 110 ms | 2,330 t/s | 20.6 GB | Short turn |
| 1,024 | 280 ms | 3,660 t/s | 20.8 GB | Typical RAG |
| 2,048 | 540 ms | 3,790 t/s | 21.0 GB | Bilingual brief |
| 4,096 | 1,180 ms | 3,470 t/s | 21.4 GB | Long doc |
| 8,192 | 2,650 ms | 3,090 t/s | 22.0 GB | Use chunked prefill |
| 12,288 | 4,400 ms | 2,790 t/s | 22.6 GB | Practical ceiling |
| 16,384 | OOM | — | — | Activations push past 24 GB |
Cross-card decode comparison, Yi-34B AWQ INT4
| GPU | VRAM | Decode b=1 | Max workable context | Aggregate b=4 |
|---|---|---|---|---|
| RTX 5090 32GB | 32 GB GDDR7 | 92 t/s | 32k | 320 t/s |
| RTX 4090 24GB | 24 GB GDDR6X | 55 t/s | 12k | 148 t/s |
| RTX 3090 24GB | 24 GB GDDR6X | 42 t/s (FP16 KV) | 4k | OOM at 4 |
| H100 80GB | 80 GB HBM3 | 140 t/s | 32k+ | 520 t/s |
| A100 80GB | 80 GB HBM2e | 78 t/s | 32k+ | 290 t/s |
The 4090 is the price-per-token sweet spot for self-hosted Yi-34B in the UK, but the per-user latency is roughly 3 ms slower than the 5090 because of the bandwidth differential. For a side-by-side analysis see 4090 vs 5090 and the 4090 vs H100 cost-per-token comparison.
Bilingual quality benchmarks
Public benchmark scores
| Benchmark | Yi-34B-Chat | Qwen 2.5 32B Inst | Llama 3.1 70B Inst | Mistral Small 24B |
|---|---|---|---|---|
| MMLU (English) | 76.3 | 83.3 | 83.6 | 72.0 |
| C-Eval (Chinese) | 81.4 | 87.7 | 52.3 | 51.5 |
| CMMLU (Chinese) | 83.7 | 89.0 | 61.5 | 54.0 |
| GSM8K | 67.6 | 95.9 | 95.1 | 74.5 |
| HumanEval (Python) | 60.4 | 86.0 | 80.5 | 72.0 |
| MT-Bench (avg) | 8.4 | 8.7 | 9.0 | 8.3 |
Why people still pick Yi-34B in 2026
On raw scores Qwen 2.5 32B has overtaken Yi-34B on every benchmark. Three reasons people still pick Yi: a different stylistic register on Chinese output that some markets prefer, a rich and stable ecosystem of fine-tunes (Yi-VL, Yi-Coder, Nous-Yi, dolphin-yi), and a permissive license that some legal teams find easier to clear than Qwen’s. If you have an existing Yi-tuned LoRA stack the migration cost to Qwen is non-trivial, so Yi remains a defensible choice for many production deployments.
Deployment configuration
vLLM launch (AWQ INT4, 8k context, 4 concurrent users)
python -m vllm.entrypoints.openai.api_server \
--model 01-ai/Yi-34B-Chat \
--quantization awq_marlin \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 8192 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.95 \
--enable-chunked-prefill \
--enforce-eager \
--swap-space 0
vLLM launch (12k context, single user, max quality)
python -m vllm.entrypoints.openai.api_server \
--model 01-ai/Yi-34B-Chat \
--quantization awq_marlin \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 12288 \
--max-num-seqs 1 \
--gpu-memory-utilization 0.96 \
--enforce-eager
Test rig and methodology
All numbers above were captured on a single-tenant Gigagpu node: RTX 4090 24GB Founders Edition at stock 450 W, Ryzen 9 7950X with 64 GB DDR5-5600, Samsung 990 Pro 2TB Gen 4 NVMe with the model files mounted on a dedicated dataset; Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6, vLLM 0.6.4, PyTorch 2.5, FlashAttention 2.6. Throughput numbers are sustained means over 60-second windows after warm-up; concurrency was driven by a custom asyncio harness wrapped around vLLM’s benchmark_throughput.py with prompts drawn from a 50/50 English/Chinese mix sampled from ShareGPT v3 and BELLE. KV pressure was tracked via nvidia-smi dmon -s u at 1 Hz to validate the headroom claims. See our vLLM setup guide for the full installation walkthrough and our AWQ quantisation guide for calibration notes that apply specifically to Yi-34B.
Production gotchas
- FP8 KV is mandatory, not optional. The FP16 KV path will OOM on the first 4k+ prompt because activations and prefix-cache blocks need ~3 GB of working memory that FP16 KV does not leave. People test with short prompts in dev, ship to production, and discover this on day one.
- Disable CUDA graphs with
--enforce-eager. Yi-34B at the VRAM ceiling is sensitive to allocator pressure; CUDA graphs reserve a working set that intermittently evicts the kernel cache and causes 30-50 ms latency spikes. The cost ofenforce-eageris roughly 0.8 t/s; the benefit is stability. - Pre-warm the model before opening to traffic. The first few prompts will hit cold marlin kernels; expected TTFT for warm runs is 280 ms but the first prompt can be 1,200 ms. A 1k-token warm-up loop in your readiness probe avoids client-visible spikes.
- Prefix caching across bilingual prompts has lower hit rates than monolingual. If you mix English and Chinese system prompts heavily, the prefix cache buys less than it does for an English-only Llama deployment. Pin one system prompt per worker and route by header.
- The Yi tokeniser is not the Llama tokeniser. Yi uses its own SentencePiece BPE with a 64,000-token vocabulary tuned for Chinese density. Token counts will be ~30% lower than Llama for the same Chinese text and ~5% higher for the same English text. Update your accounting accordingly.
- Spread multi-tenant workloads across cards rather than batching. The KV pressure ceiling is so close that a single misbehaving multi-turn user can starve other tenants. Treat one 4090 as one customer slot for production-critical Yi-34B work.
- Watch sustained VRAM in
nvidia-smi. Steady-state usage of 23+ GB is normal for this configuration. If you see usage drop below 21 GB it usually means vLLM has shrunk the KV pool because of fragmentation; restart the worker.
Cost per million tokens
| Workload | Sustained t/s | Tokens / 24h | Energy / Mtok | £ / Mtok (UK power 28p/kWh) |
|---|---|---|---|---|
| Single-user chat (b=1) | 55 | 4.75M | 2.27 kWh | £0.64 |
| Light concurrency (b=2) | 95 | 8.21M | 1.31 kWh | £0.37 |
| Steady multi-user (b=4) | 148 | 12.79M | 0.84 kWh | £0.24 |
| Burst load (b=6) | 175 | 15.12M | 0.71 kWh | £0.20 |
Compared to API alternatives at roughly £8-12 per million output tokens for similar 30B-class quality, the 4090 path pays back within days for any sustained workload. See the broader monthly hosting cost and tokens per watt analyses for the wider economics, and vs OpenAI API cost for the break-even calculation.
Verdict and alternatives
Yi-34B-Chat AWQ on a 4090 is the right call when you need bilingual EN/CN at 30B-class quality, your traffic profile is one to four concurrent users with sub-12k contexts, and you have a deployment reason (existing fine-tunes, license preferences, stylistic register) to choose Yi over Qwen. If you are starting from a clean slate, Qwen 2.5 32B AWQ wins on raw quality at almost identical VRAM. If you can handle slightly less bilingual capability for much higher throughput, Mixtral 8x7B at 85 t/s is meaningfully snappier. For 70B-class reasoning on the same card see the Llama 3 70B INT4 model guide. If you outgrow the 24 GB envelope, the natural next step is the 5090 32GB upgrade which lets Yi-34B breathe with a 32k context.
Deploy Yi-34B on a UK RTX 4090
AWQ INT4 with FP8 KV – bilingual EN/CN at 55 t/s decode, 12k workable context, single-tenant UK dedicated hosting.
Order the RTX 4090 24GBSee also: Qwen 2.5 32B model guide, Qwen 32B benchmark, Llama 70B INT4 deployment, AWQ quantisation guide, vLLM setup, prefill/decode benchmark, concurrent users.