Table of Contents
DeepSeek-Coder V2 Lite is one of the strongest open-weight code models that actually fits on a 24 GB GPU. The Mixture-of-Experts architecture (16B params, 2.4B active per token) makes it fast to run while scoring close to a dense 30B model on coding benchmarks. The RTX 4090’s 24 GB is the budget pick for hosting it.
DeepSeek-Coder V2 Lite at AWQ-INT4 (10 GB) fits comfortably on a 24 GB RTX 4090 with room for KV cache and an embedding model. Expect ~410 tok/s aggregate, ~28 tok/s single-stream. £289/mo at GigaGPU; cheaper per dev than DeepSeek's API once you hit ~10 active developers.
Does it fit?
| Precision | Weight VRAM | + KV cache (8K, 16 streams) | Total | Fits 24 GB? |
|---|---|---|---|---|
| FP16 | 32 GB | +5 GB | 37 GB | No |
| FP8 | 16 GB | +5 GB | 21 GB | Tight |
| AWQ-INT4 | 10 GB | +5 GB | 15 GB | Yes, comfortable |
| GGUF Q5_K_M | 12 GB | +5 GB | 17 GB | Yes |
vLLM config
vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
--quantization awq_marlin \
--max-model-len 32768 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--kv-cache-dtype fp8_e4m3 \
--served-model-name deepseek-coder \
--host 0.0.0.0 --port 8000
Performance
| Metric | RTX 4090 result |
|---|---|
| Aggregate tok/s @ 16 concurrent | ~410 |
| Single-stream tok/s | ~28 |
| Median TTFT (1K-token prompt) | ~280 ms |
| p99 TTFT | ~720 ms |
| Cost per 1M tokens (60% util) | £0.30 |
vs the alternatives
| Option | Aggregate tok/s | Cost per 1M | Verdict |
|---|---|---|---|
| RTX 4090 AWQ-INT4 | 410 | £0.30 | Reference |
| RTX 5090 AWQ-INT4 | 780 | £0.30 | Same cost-per-token, ~2× capacity |
| RTX 5090 FP8 | 950 | £0.24 | Best cost-per-token |
| DeepSeek API | n/a | £0.18 (output) | Cheapest at low volume |
Verdict
The RTX 4090 24 GB is a credible host for DeepSeek-Coder V2 Lite — comfortable INT4 fit, reasonable throughput, predictable cost. It loses on cost-per-token to the 5090 + FP8 path; consider it the right pick if 4090 stock is meaningfully cheaper than 5090 stock at the time you order.
Bottom line
For DeepSeek-Coder V2 Lite, the RTX 4090 is a solid mid-tier host. The 5090 is the better choice if FP8 is available; the API is cheaper at low volume. See best GPU for DeepSeek.