DeepSeek-Coder-V2-Lite-Instruct is the smaller MoE sibling of DeepSeek’s flagship 236B coder, and it is one of the most interesting models in the open-coder ecosystem because of its architectural extremism: 15.7B total parameters but only 2.4B active per token thanks to top-6-of-64 expert routing plus 2 always-on shared experts. FP8 weights at roughly 16 GB fit the RTX 4090 24GB easily, and the tiny active set means 120 t/s decode at FP8, 145 t/s at AWQ – faster than dense 7B models on the same card while delivering coding quality competitive with 14B dense coders. Apache 2.0 licence. This guide on our UK dedicated GPU hosting covers the MoE economics, deployment, and the seven gotchas that catch teams who treat it like a dense model.
Contents
- MoE economics: 16B total, 2.4B active
- Architecture, routing and KV math
- VRAM accounting across precisions
- Throughput, latency and concurrency
- Quality benchmarks vs alternatives
- vLLM deployment and code
- Production gotchas (seven items)
- When to pick this over alternatives
- Verdict
MoE economics: 16B total, 2.4B active
DeepSeek-Coder-V2-Lite has 64 routed experts plus 2 shared (always-on) experts per layer, with top-6 routing per token. Total parameters are 15.7B; active parameters per forward pass are 2.4B. The card holds the full weight set in HBM, but decode bandwidth – the bottleneck on a 1008 GB/s 4090 – scales with the active set. Net effect: a 16B-class model runs at the speed of a 2-3B dense model, which is dramatically faster than any dense coder of comparable quality.
The Apache 2.0 licence is the second key advantage. Where Codestral 22B requires a paid commercial licence and Qwen Coder uses the Qwen Community License, DeepSeek-Coder-V2-Lite has clean Apache 2.0 with no commercial restrictions. For startups and SaaS builders, this combination – MoE speed, dense-equivalent quality, permissive licence – is hard to beat for high-RPS coding workloads.
Architecture, routing and KV math
DeepSeek-Coder-V2-Lite is a 27-layer transformer (fewer layers than Qwen 14B’s 48) with 16 query heads and 16 KV heads (no GQA – 1:1 ratio, which is unusual; the lite version skipped GQA where the 236B parent uses MLA). Hidden dim is 2,048, with each MoE layer expanding to 64 routed experts of intermediate dim 1,408 each, plus 2 shared experts of intermediate dim 1,408. Native context is 128k via DeepSeek’s custom YaRN extension; we recommend 32k for production use to keep KV memory bounded. Vocabulary is 102,400 tokens.
KV cache per token at FP16: 16 KV heads x 128 dim x 27 layers x 2 (K+V) x 2 bytes = 221 KB. At FP8 KV: 110.5 KB. So 32k context = 3.5 GB FP8 per sequence – moderate. The lack of GQA means KV is heavier per token than Qwen Coder 14B (which has 8:1 GQA-like compression effectively).
VRAM accounting across precisions
| Component | FP16 | FP8 (W8A8) | AWQ INT4 |
|---|---|---|---|
| Weights (full expert set) | 32 GB | 16 GB | 9 GB |
| KV @ 32k FP8, 1 seq | 3.5 GB | 3.5 GB | 3.5 GB |
| CUDA graphs + workspace | 1.6 GB | 1.6 GB | 1.6 GB |
| vLLM scheduler + activations | 1.4 GB | 1.4 GB | 1.4 GB |
| Total at batch 1, 32k | OOM (38.5 GB) | 22.5 GB | 15.5 GB |
| Headroom under 24 GB | n/a | 1.5 GB | 8.5 GB |
FP8 is the natural fit – the model is small enough that you do not need INT4 for the fit, and FP8 quality is essentially lossless on coding benchmarks. For higher concurrency, drop to AWQ which leaves 8.5 GB free. The lack of GQA punishes you on long-context concurrency: 4 sequences at 32k uses 14 GB of KV alone at FP8, which forces concurrency limits below dense GQA models.
| Concurrency budget (FP8 + FP8 KV) | Avg context | KV total | Verdict |
|---|---|---|---|
| 1 seq | 32k | 3.5 GB | Comfortable, 1.5 GB free |
| 4 seqs | 4k | 1.75 GB | Comfortable |
| 8 seqs | 4k | 3.5 GB | Comfortable |
| 16 seqs | 2k | 3.5 GB | Tight – cap at 12 |
Throughput, latency and concurrency
vLLM 0.6.3, FP8 KV, prompt 1024, generate 256:
| Quant | Batch 1 t/s | Batch 4 | Batch 8 | TTFT b1 (ms) |
|---|---|---|---|---|
| FP8 W8A8 | 120 | 410 | 620 | 95 |
| AWQ INT4 (Marlin) | 145 | 490 | 720 | 80 |
| FP16 (will not fit at 32k) | n/a | n/a | n/a | n/a |
The combination of MoE sparsity and the 4090’s bandwidth gives this model some of the best latency-per-quality ratios in the open ecosystem – 145 t/s AWQ vs Qwen Coder 14B’s 135 t/s at slightly lower quality, or vs Mixtral 8x7B’s 85 t/s at vastly higher quality. Prefill at 8k batch 1: FP8 measures 3,800 t/s, AWQ 4,200 t/s.
Quality benchmarks vs alternatives
| Benchmark | DeepSeek Coder V2 Lite | Qwen 2.5 Coder 14B | Codestral 22B | Qwen 2.5 Coder 32B |
|---|---|---|---|---|
| HumanEval | 81.1 | 83.5 | 81.1 | 92.7 |
| MBPP | 78.4 | 78.4 | 78.2 | 87.0 |
| MultiPL-E avg (18 langs) | 71.4 | 71.2 | 61.8 | 78.5 |
| LiveCodeBench | 24.3 | 22.8 | 20.2 | 31.4 |
| Active params per token | 2.4B | 14B | 22B | 32B |
| 4090 single-card t/s (best quant) | 145 | 135 | 80 | 65 |
| Licence | Apache 2.0 | Apache equiv | MNPL paid | Apache equiv |
For raw quality the dense Qwen Coder 14B is slightly ahead on HumanEval (83.5 vs 81.1) but DeepSeek wins MultiPL-E and LiveCodeBench. The headline is the throughput-per-quality ratio: DeepSeek delivers 145 t/s at quality close to Qwen Coder 14B’s 135 t/s, with Apache 2.0 licensing and lower KV cost from a smaller model.
vLLM deployment and code
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
--quantization fp8 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--max-num-seqs 8 \
--enable-chunked-prefill --enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--trust-remote-code
--trust-remote-code is mandatory – DeepSeek ships custom MoE routing code in the model repository that vLLM must execute. The native context is 128k via DeepSeek YaRN, but 32k is plenty for nearly all coding work and saves KV memory. For higher concurrency, switch to AWQ:
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/DeepSeek-Coder-V2-Lite-Instruct-AWQ \
--quantization awq_marlin \
--kv-cache-dtype fp8 \
--max-model-len 16384 --max-num-seqs 16 \
--enable-chunked-prefill --enable-prefix-caching \
--gpu-memory-utilization 0.93 \
--trust-remote-code
Chat template uses DeepSeek’s specific format; tokenizer.apply_chat_template handles it correctly:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct",
trust_remote_code=True)
msgs = [
{"role":"user","content":"Write a Python function to detect a palindrome."}
]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
# <|begin▁of▁sentence|>User: Write a Python...\n\nAssistant:
FIM uses the standard <|fim▁begin|> / <|fim▁hole|> / <|fim▁end|> tokens (note the special-character delimiters – copy directly, do not retype). See the vLLM setup guide for trust-remote-code prerequisites and the coding assistant guide for editor integration.
Production gotchas (seven items)
- Custom routing code required:
--trust-remote-codeis mandatory because DeepSeek’s MoE router uses custom Python that ships with the model. Pin the model commit hash for reproducibility – upstream changes can subtly affect routing. - No GQA hurts long-context concurrency: 1:1 KV ratio means KV memory is heavier per token than Qwen GQA models. Cap concurrency below where you would expect for a 16B model.
- FIM token delimiters are special chars:
<|fim▁begin|>uses U+FF5C and U+2581 not regular ASCII. Editor plugins that retype these tokens fail silently. Always copy the exact bytes. - Expert routing imbalance under adversarial inputs: at batch 1 with very repetitive prompts, all experts can route similarly causing pipeline serialisation. Not a problem in normal coding workloads.
- YaRN to 128k is not production-grade: long-context retrieval drops past 32k. Use 32k native; chunk-summarise larger files.
- Model commit pinning: DeepSeek occasionally pushes minor updates to the routing weights. Pin
--revisionto a known-good commit for production stability. - Power and thermal at sustained batch 16: 4090 draws 440-450 W. Cap to 410 W with
nvidia-smi -pl 410, see power draw analysis.
When to pick this over alternatives
Pick DeepSeek-Coder-V2-Lite over Qwen 2.5 Coder 14B when raw throughput is the headline (high-RPS code SaaS) and Apache 2.0 licensing matters. Pick it over Codestral 22B on every dimension – faster, comparable quality, free commercial licence. Pick it over Mixtral 8x7B for any code workload – vastly better at coding while running faster.
Step up to Qwen 2.5 Coder 32B when raw quality on hard tasks matters more than throughput. Step laterally to Qwen Coder 14B when FIM quality (RepoBench-style) is the focus rather than raw HumanEval. Step down only if you need extreme RPS – DeepSeek-Coder-V2-Lite is already the fastest competitive coder.
Verdict
DeepSeek-Coder-V2-Lite-Instruct on the RTX 4090 24GB is the highest-throughput competitive open coder on a single card in 2026. 145 t/s AWQ at quality close to Qwen Coder 14B, 32k context, Apache 2.0 licence, MoE sparsity that makes it feel like a 2-3B dense model. The trade-offs are manageable: trust-remote-code requirement, no GQA limiting long-context concurrency, FIM token delimiters that bite naive integrations. For high-RPS code SaaS, batch refactoring pipelines, or any team where throughput per dollar drives the decision, this is a default 4090 choice. For absolute peak quality pick Qwen Coder 32B; for the strongest 14B alternative pick Qwen Coder 14B. Compare with the RTX 5090 32GB if you need higher concurrency.
Deploy DeepSeek-Coder-V2-Lite on a UK RTX 4090
16B MoE, 2.4B active, 145 t/s AWQ. Apache 2.0. UK dedicated hosting from Manchester.
Order the RTX 4090 24GBSee also: Qwen 2.5 Coder 14B, Qwen 2.5 Coder 32B, Codestral 22B, Mixtral 8x7B, coding assistant deployment, FP8 deployment, monthly cost.