RTX 3050 - Order Now
Home / Blog / Model Guides / RTX 4090 24GB for DeepSeek-Coder-V2-Lite: 16B MoE, 2.4B Active, 145 t/s
Model Guides

RTX 4090 24GB for DeepSeek-Coder-V2-Lite: 16B MoE, 2.4B Active, 145 t/s

DeepSeek-Coder-V2-Lite-Instruct on the RTX 4090 24GB - 16B MoE with just 2.4B active params, FP8 fits at 16GB, 120 t/s decode, Apache 2.0 - the fastest open coder per quality unit on a single card.

DeepSeek-Coder-V2-Lite-Instruct is the smaller MoE sibling of DeepSeek’s flagship 236B coder, and it is one of the most interesting models in the open-coder ecosystem because of its architectural extremism: 15.7B total parameters but only 2.4B active per token thanks to top-6-of-64 expert routing plus 2 always-on shared experts. FP8 weights at roughly 16 GB fit the RTX 4090 24GB easily, and the tiny active set means 120 t/s decode at FP8, 145 t/s at AWQ – faster than dense 7B models on the same card while delivering coding quality competitive with 14B dense coders. Apache 2.0 licence. This guide on our UK dedicated GPU hosting covers the MoE economics, deployment, and the seven gotchas that catch teams who treat it like a dense model.

Contents

MoE economics: 16B total, 2.4B active

DeepSeek-Coder-V2-Lite has 64 routed experts plus 2 shared (always-on) experts per layer, with top-6 routing per token. Total parameters are 15.7B; active parameters per forward pass are 2.4B. The card holds the full weight set in HBM, but decode bandwidth – the bottleneck on a 1008 GB/s 4090 – scales with the active set. Net effect: a 16B-class model runs at the speed of a 2-3B dense model, which is dramatically faster than any dense coder of comparable quality.

The Apache 2.0 licence is the second key advantage. Where Codestral 22B requires a paid commercial licence and Qwen Coder uses the Qwen Community License, DeepSeek-Coder-V2-Lite has clean Apache 2.0 with no commercial restrictions. For startups and SaaS builders, this combination – MoE speed, dense-equivalent quality, permissive licence – is hard to beat for high-RPS coding workloads.

Architecture, routing and KV math

DeepSeek-Coder-V2-Lite is a 27-layer transformer (fewer layers than Qwen 14B’s 48) with 16 query heads and 16 KV heads (no GQA – 1:1 ratio, which is unusual; the lite version skipped GQA where the 236B parent uses MLA). Hidden dim is 2,048, with each MoE layer expanding to 64 routed experts of intermediate dim 1,408 each, plus 2 shared experts of intermediate dim 1,408. Native context is 128k via DeepSeek’s custom YaRN extension; we recommend 32k for production use to keep KV memory bounded. Vocabulary is 102,400 tokens.

KV cache per token at FP16: 16 KV heads x 128 dim x 27 layers x 2 (K+V) x 2 bytes = 221 KB. At FP8 KV: 110.5 KB. So 32k context = 3.5 GB FP8 per sequence – moderate. The lack of GQA means KV is heavier per token than Qwen Coder 14B (which has 8:1 GQA-like compression effectively).

VRAM accounting across precisions

ComponentFP16FP8 (W8A8)AWQ INT4
Weights (full expert set)32 GB16 GB9 GB
KV @ 32k FP8, 1 seq3.5 GB3.5 GB3.5 GB
CUDA graphs + workspace1.6 GB1.6 GB1.6 GB
vLLM scheduler + activations1.4 GB1.4 GB1.4 GB
Total at batch 1, 32kOOM (38.5 GB)22.5 GB15.5 GB
Headroom under 24 GBn/a1.5 GB8.5 GB

FP8 is the natural fit – the model is small enough that you do not need INT4 for the fit, and FP8 quality is essentially lossless on coding benchmarks. For higher concurrency, drop to AWQ which leaves 8.5 GB free. The lack of GQA punishes you on long-context concurrency: 4 sequences at 32k uses 14 GB of KV alone at FP8, which forces concurrency limits below dense GQA models.

Concurrency budget (FP8 + FP8 KV)Avg contextKV totalVerdict
1 seq32k3.5 GBComfortable, 1.5 GB free
4 seqs4k1.75 GBComfortable
8 seqs4k3.5 GBComfortable
16 seqs2k3.5 GBTight – cap at 12

Throughput, latency and concurrency

vLLM 0.6.3, FP8 KV, prompt 1024, generate 256:

QuantBatch 1 t/sBatch 4Batch 8TTFT b1 (ms)
FP8 W8A812041062095
AWQ INT4 (Marlin)14549072080
FP16 (will not fit at 32k)n/an/an/an/a

The combination of MoE sparsity and the 4090’s bandwidth gives this model some of the best latency-per-quality ratios in the open ecosystem – 145 t/s AWQ vs Qwen Coder 14B’s 135 t/s at slightly lower quality, or vs Mixtral 8x7B’s 85 t/s at vastly higher quality. Prefill at 8k batch 1: FP8 measures 3,800 t/s, AWQ 4,200 t/s.

Quality benchmarks vs alternatives

BenchmarkDeepSeek Coder V2 LiteQwen 2.5 Coder 14BCodestral 22BQwen 2.5 Coder 32B
HumanEval81.183.581.192.7
MBPP78.478.478.287.0
MultiPL-E avg (18 langs)71.471.261.878.5
LiveCodeBench24.322.820.231.4
Active params per token2.4B14B22B32B
4090 single-card t/s (best quant)1451358065
LicenceApache 2.0Apache equivMNPL paidApache equiv

For raw quality the dense Qwen Coder 14B is slightly ahead on HumanEval (83.5 vs 81.1) but DeepSeek wins MultiPL-E and LiveCodeBench. The headline is the throughput-per-quality ratio: DeepSeek delivers 145 t/s at quality close to Qwen Coder 14B’s 135 t/s, with Apache 2.0 licensing and lower KV cost from a smaller model.

vLLM deployment and code

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --max-num-seqs 8 \
  --enable-chunked-prefill --enable-prefix-caching \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code

--trust-remote-code is mandatory – DeepSeek ships custom MoE routing code in the model repository that vLLM must execute. The native context is 128k via DeepSeek YaRN, but 32k is plenty for nearly all coding work and saves KV memory. For higher concurrency, switch to AWQ:

python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/DeepSeek-Coder-V2-Lite-Instruct-AWQ \
  --quantization awq_marlin \
  --kv-cache-dtype fp8 \
  --max-model-len 16384 --max-num-seqs 16 \
  --enable-chunked-prefill --enable-prefix-caching \
  --gpu-memory-utilization 0.93 \
  --trust-remote-code

Chat template uses DeepSeek’s specific format; tokenizer.apply_chat_template handles it correctly:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct",
                                     trust_remote_code=True)
msgs = [
  {"role":"user","content":"Write a Python function to detect a palindrome."}
]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
# <|begin▁of▁sentence|>User: Write a Python...\n\nAssistant:

FIM uses the standard <|fim▁begin|> / <|fim▁hole|> / <|fim▁end|> tokens (note the special-character delimiters – copy directly, do not retype). See the vLLM setup guide for trust-remote-code prerequisites and the coding assistant guide for editor integration.

Production gotchas (seven items)

  • Custom routing code required: --trust-remote-code is mandatory because DeepSeek’s MoE router uses custom Python that ships with the model. Pin the model commit hash for reproducibility – upstream changes can subtly affect routing.
  • No GQA hurts long-context concurrency: 1:1 KV ratio means KV memory is heavier per token than Qwen GQA models. Cap concurrency below where you would expect for a 16B model.
  • FIM token delimiters are special chars: <|fim▁begin|> uses U+FF5C and U+2581 not regular ASCII. Editor plugins that retype these tokens fail silently. Always copy the exact bytes.
  • Expert routing imbalance under adversarial inputs: at batch 1 with very repetitive prompts, all experts can route similarly causing pipeline serialisation. Not a problem in normal coding workloads.
  • YaRN to 128k is not production-grade: long-context retrieval drops past 32k. Use 32k native; chunk-summarise larger files.
  • Model commit pinning: DeepSeek occasionally pushes minor updates to the routing weights. Pin --revision to a known-good commit for production stability.
  • Power and thermal at sustained batch 16: 4090 draws 440-450 W. Cap to 410 W with nvidia-smi -pl 410, see power draw analysis.

When to pick this over alternatives

Pick DeepSeek-Coder-V2-Lite over Qwen 2.5 Coder 14B when raw throughput is the headline (high-RPS code SaaS) and Apache 2.0 licensing matters. Pick it over Codestral 22B on every dimension – faster, comparable quality, free commercial licence. Pick it over Mixtral 8x7B for any code workload – vastly better at coding while running faster.

Step up to Qwen 2.5 Coder 32B when raw quality on hard tasks matters more than throughput. Step laterally to Qwen Coder 14B when FIM quality (RepoBench-style) is the focus rather than raw HumanEval. Step down only if you need extreme RPS – DeepSeek-Coder-V2-Lite is already the fastest competitive coder.

Verdict

DeepSeek-Coder-V2-Lite-Instruct on the RTX 4090 24GB is the highest-throughput competitive open coder on a single card in 2026. 145 t/s AWQ at quality close to Qwen Coder 14B, 32k context, Apache 2.0 licence, MoE sparsity that makes it feel like a 2-3B dense model. The trade-offs are manageable: trust-remote-code requirement, no GQA limiting long-context concurrency, FIM token delimiters that bite naive integrations. For high-RPS code SaaS, batch refactoring pipelines, or any team where throughput per dollar drives the decision, this is a default 4090 choice. For absolute peak quality pick Qwen Coder 32B; for the strongest 14B alternative pick Qwen Coder 14B. Compare with the RTX 5090 32GB if you need higher concurrency.

Deploy DeepSeek-Coder-V2-Lite on a UK RTX 4090

16B MoE, 2.4B active, 145 t/s AWQ. Apache 2.0. UK dedicated hosting from Manchester.

Order the RTX 4090 24GB

See also: Qwen 2.5 Coder 14B, Qwen 2.5 Coder 32B, Codestral 22B, Mixtral 8x7B, coding assistant deployment, FP8 deployment, monthly cost.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?