RTX 4090 24GB for DeepSeek-Coder-V2-Lite: 16B MoE, 2.4B Active, 145 t/s GIGAGPU

DeepSeek-Coder-V2-Lite-Instruct is the smaller MoE sibling of DeepSeek’s flagship 236B coder, and it is one of the most interesting models in the open-coder ecosystem because of its architectural extremism: 15.7B total parameters but only 2.4B active per token thanks to top-6-of-64 expert routing plus 2 always-on shared experts. FP8 weights at roughly 16 GB fit the RTX 4090 24GB easily, and the tiny active set means 120 t/s decode at FP8, 145 t/s at AWQ – faster than dense 7B models on the same card while delivering coding quality competitive with 14B dense coders. Apache 2.0 licence. This guide on our UK dedicated GPU hosting covers the MoE economics, deployment, and the seven gotchas that catch teams who treat it like a dense model.

MoE economics: 16B total, 2.4B active

DeepSeek-Coder-V2-Lite has 64 routed experts plus 2 shared (always-on) experts per layer, with top-6 routing per token. Total parameters are 15.7B; active parameters per forward pass are 2.4B. The card holds the full weight set in HBM, but decode bandwidth – the bottleneck on a 1008 GB/s 4090 – scales with the active set. Net effect: a 16B-class model runs at the speed of a 2-3B dense model, which is dramatically faster than any dense coder of comparable quality.

The Apache 2.0 licence is the second key advantage. Where Codestral 22B requires a paid commercial licence and Qwen Coder uses the Qwen Community License, DeepSeek-Coder-V2-Lite has clean Apache 2.0 with no commercial restrictions. For startups and SaaS builders, this combination – MoE speed, dense-equivalent quality, permissive licence – is hard to beat for high-RPS coding workloads.

Architecture, routing and KV math

DeepSeek-Coder-V2-Lite is a 27-layer transformer (fewer layers than Qwen 14B’s 48) with 16 query heads and 16 KV heads (no GQA – 1:1 ratio, which is unusual; the lite version skipped GQA where the 236B parent uses MLA). Hidden dim is 2,048, with each MoE layer expanding to 64 routed experts of intermediate dim 1,408 each, plus 2 shared experts of intermediate dim 1,408. Native context is 128k via DeepSeek’s custom YaRN extension; we recommend 32k for production use to keep KV memory bounded. Vocabulary is 102,400 tokens.

KV cache per token at FP16: 16 KV heads x 128 dim x 27 layers x 2 (K+V) x 2 bytes = 221 KB. At FP8 KV: 110.5 KB. So 32k context = 3.5 GB FP8 per sequence – moderate. The lack of GQA means KV is heavier per token than Qwen Coder 14B (which has 8:1 GQA-like compression effectively).

VRAM accounting across precisions

Component	FP16	FP8 (W8A8)	AWQ INT4
Weights (full expert set)	32 GB	16 GB	9 GB
KV @ 32k FP8, 1 seq	3.5 GB	3.5 GB	3.5 GB
CUDA graphs + workspace	1.6 GB	1.6 GB	1.6 GB
vLLM scheduler + activations	1.4 GB	1.4 GB	1.4 GB
Total at batch 1, 32k	OOM (38.5 GB)	22.5 GB	15.5 GB
Headroom under 24 GB	n/a	1.5 GB	8.5 GB

FP8 is the natural fit – the model is small enough that you do not need INT4 for the fit, and FP8 quality is essentially lossless on coding benchmarks. For higher concurrency, drop to AWQ which leaves 8.5 GB free. The lack of GQA punishes you on long-context concurrency: 4 sequences at 32k uses 14 GB of KV alone at FP8, which forces concurrency limits below dense GQA models.

Concurrency budget (FP8 + FP8 KV)	Avg context	KV total	Verdict
1 seq	32k	3.5 GB	Comfortable, 1.5 GB free
4 seqs	4k	1.75 GB	Comfortable
8 seqs	4k	3.5 GB	Comfortable
16 seqs	2k	3.5 GB	Tight – cap at 12

Throughput, latency and concurrency

vLLM 0.6.3, FP8 KV, prompt 1024, generate 256:

Quant	Batch 1 t/s	Batch 4	Batch 8	TTFT b1 (ms)
FP8 W8A8	120	410	620	95
AWQ INT4 (Marlin)	145	490	720	80
FP16 (will not fit at 32k)	n/a	n/a	n/a	n/a

The combination of MoE sparsity and the 4090’s bandwidth gives this model some of the best latency-per-quality ratios in the open ecosystem – 145 t/s AWQ vs Qwen Coder 14B’s 135 t/s at slightly lower quality, or vs Mixtral 8x7B’s 85 t/s at vastly higher quality. Prefill at 8k batch 1: FP8 measures 3,800 t/s, AWQ 4,200 t/s.

Quality benchmarks vs alternatives

Benchmark	DeepSeek Coder V2 Lite	Qwen 2.5 Coder 14B	Codestral 22B	Qwen 2.5 Coder 32B
HumanEval	81.1	83.5	81.1	92.7
MBPP	78.4	78.4	78.2	87.0
MultiPL-E avg (18 langs)	71.4	71.2	61.8	78.5
LiveCodeBench	24.3	22.8	20.2	31.4
Active params per token	2.4B	14B	22B	32B
4090 single-card t/s (best quant)	145	135	80	65
Licence	Apache 2.0	Apache equiv	MNPL paid	Apache equiv

For raw quality the dense Qwen Coder 14B is slightly ahead on HumanEval (83.5 vs 81.1) but DeepSeek wins MultiPL-E and LiveCodeBench. The headline is the throughput-per-quality ratio: DeepSeek delivers 145 t/s at quality close to Qwen Coder 14B’s 135 t/s, with Apache 2.0 licensing and lower KV cost from a smaller model.

vLLM deployment and code

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --max-num-seqs 8 \
  --enable-chunked-prefill --enable-prefix-caching \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code

--trust-remote-code is mandatory – DeepSeek ships custom MoE routing code in the model repository that vLLM must execute. The native context is 128k via DeepSeek YaRN, but 32k is plenty for nearly all coding work and saves KV memory. For higher concurrency, switch to AWQ:

python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/DeepSeek-Coder-V2-Lite-Instruct-AWQ \
  --quantization awq_marlin \
  --kv-cache-dtype fp8 \
  --max-model-len 16384 --max-num-seqs 16 \
  --enable-chunked-prefill --enable-prefix-caching \
  --gpu-memory-utilization 0.93 \
  --trust-remote-code

Chat template uses DeepSeek’s specific format; tokenizer.apply_chat_template handles it correctly:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct",
                                     trust_remote_code=True)
msgs = [
  {"role":"user","content":"Write a Python function to detect a palindrome."}
]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
# <｜begin▁of▁sentence｜>User: Write a Python...\n\nAssistant:

FIM uses the standard <｜fim▁begin｜> / <｜fim▁hole｜> / <｜fim▁end｜> tokens (note the special-character delimiters – copy directly, do not retype). See the vLLM setup guide for trust-remote-code prerequisites and the coding assistant guide for editor integration.

Production gotchas (seven items)

Custom routing code required: --trust-remote-code is mandatory because DeepSeek’s MoE router uses custom Python that ships with the model. Pin the model commit hash for reproducibility – upstream changes can subtly affect routing.
No GQA hurts long-context concurrency: 1:1 KV ratio means KV memory is heavier per token than Qwen GQA models. Cap concurrency below where you would expect for a 16B model.
FIM token delimiters are special chars: <｜fim▁begin｜> uses U+FF5C and U+2581 not regular ASCII. Editor plugins that retype these tokens fail silently. Always copy the exact bytes.
Expert routing imbalance under adversarial inputs: at batch 1 with very repetitive prompts, all experts can route similarly causing pipeline serialisation. Not a problem in normal coding workloads.
YaRN to 128k is not production-grade: long-context retrieval drops past 32k. Use 32k native; chunk-summarise larger files.
Model commit pinning: DeepSeek occasionally pushes minor updates to the routing weights. Pin --revision to a known-good commit for production stability.
Power and thermal at sustained batch 16: 4090 draws 440-450 W. Cap to 410 W with nvidia-smi -pl 410, see power draw analysis.

When to pick this over alternatives

Pick DeepSeek-Coder-V2-Lite over Qwen 2.5 Coder 14B when raw throughput is the headline (high-RPS code SaaS) and Apache 2.0 licensing matters. Pick it over Codestral 22B on every dimension – faster, comparable quality, free commercial licence. Pick it over Mixtral 8x7B for any code workload – vastly better at coding while running faster.

Step up to Qwen 2.5 Coder 32B when raw quality on hard tasks matters more than throughput. Step laterally to Qwen Coder 14B when FIM quality (RepoBench-style) is the focus rather than raw HumanEval. Step down only if you need extreme RPS – DeepSeek-Coder-V2-Lite is already the fastest competitive coder.

Verdict

DeepSeek-Coder-V2-Lite-Instruct on the RTX 4090 24GB is the highest-throughput competitive open coder on a single card in 2026. 145 t/s AWQ at quality close to Qwen Coder 14B, 32k context, Apache 2.0 licence, MoE sparsity that makes it feel like a 2-3B dense model. The trade-offs are manageable: trust-remote-code requirement, no GQA limiting long-context concurrency, FIM token delimiters that bite naive integrations. For high-RPS code SaaS, batch refactoring pipelines, or any team where throughput per dollar drives the decision, this is a default 4090 choice. For absolute peak quality pick Qwen Coder 32B; for the strongest 14B alternative pick Qwen Coder 14B. Compare with the RTX 5090 32GB if you need higher concurrency.

Deploy DeepSeek-Coder-V2-Lite on a UK RTX 4090

16B MoE, 2.4B active, 145 t/s AWQ. Apache 2.0. UK dedicated hosting from Manchester.

Order the RTX 4090 24GB

RTX 4090 24GB for DeepSeek-Coder-V2-Lite: 16B MoE, 2.4B Active, 145 t/s

Contents

MoE economics: 16B total, 2.4B active

Architecture, routing and KV math

VRAM accounting across precisions

Throughput, latency and concurrency

Quality benchmarks vs alternatives

vLLM deployment and code

Production gotchas (seven items)

When to pick this over alternatives

Verdict

Deploy DeepSeek-Coder-V2-Lite on a UK RTX 4090

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB for DeepSeek-Coder-V2-Lite: 16B MoE, 2.4B Active, 145 t/s

Contents

MoE economics: 16B total, 2.4B active

Architecture, routing and KV math

VRAM accounting across precisions

Throughput, latency and concurrency

Quality benchmarks vs alternatives

vLLM deployment and code

Production gotchas (seven items)

When to pick this over alternatives

Verdict

Deploy DeepSeek-Coder-V2-Lite on a UK RTX 4090

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 5060 Ti 16GB for CodeLlama 13B

Llama 3.1 70B vs Llama 3.3 70B: Worth the Upgrade?

Mistral Nemo 12B on a Dedicated GPU

RTX 4090 24 GB for DeepSeek-Coder V2 Lite: A Concrete Deployment Guide

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?