RTX 3050 - Order Now
Home / Blog / Model Guides / RTX 4090 24 GB for DeepSeek-Coder V2 Lite: A Concrete Deployment Guide
Model Guides

RTX 4090 24 GB for DeepSeek-Coder V2 Lite: A Concrete Deployment Guide

DeepSeek-Coder V2 Lite (16B MoE, 2.4B active) on a single RTX 4090 24 GB — VRAM math, vLLM config, real benchmark numbers.

DeepSeek-Coder V2 Lite is one of the strongest open-weight code models that actually fits on a 24 GB GPU. The Mixture-of-Experts architecture (16B params, 2.4B active per token) makes it fast to run while scoring close to a dense 30B model on coding benchmarks. The RTX 4090’s 24 GB is the budget pick for hosting it.

TL;DR

DeepSeek-Coder V2 Lite at AWQ-INT4 (10 GB) fits comfortably on a 24 GB RTX 4090 with room for KV cache and an embedding model. Expect ~410 tok/s aggregate, ~28 tok/s single-stream. £289/mo at GigaGPU; cheaper per dev than DeepSeek's API once you hit ~10 active developers.

Does it fit?

PrecisionWeight VRAM+ KV cache (8K, 16 streams)TotalFits 24 GB?
FP1632 GB+5 GB37 GBNo
FP816 GB+5 GB21 GBTight
AWQ-INT410 GB+5 GB15 GBYes, comfortable
GGUF Q5_K_M12 GB+5 GB17 GBYes

vLLM config

vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
  --quantization awq_marlin \
  --max-model-len 32768 \
  --max-num-seqs 16 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8_e4m3 \
  --served-model-name deepseek-coder \
  --host 0.0.0.0 --port 8000

Performance

MetricRTX 4090 result
Aggregate tok/s @ 16 concurrent~410
Single-stream tok/s~28
Median TTFT (1K-token prompt)~280 ms
p99 TTFT~720 ms
Cost per 1M tokens (60% util)£0.30

vs the alternatives

OptionAggregate tok/sCost per 1MVerdict
RTX 4090 AWQ-INT4410£0.30Reference
RTX 5090 AWQ-INT4780£0.30Same cost-per-token, ~2× capacity
RTX 5090 FP8950£0.24Best cost-per-token
DeepSeek APIn/a£0.18 (output)Cheapest at low volume

Verdict

The RTX 4090 24 GB is a credible host for DeepSeek-Coder V2 Lite — comfortable INT4 fit, reasonable throughput, predictable cost. It loses on cost-per-token to the 5090 + FP8 path; consider it the right pick if 4090 stock is meaningfully cheaper than 5090 stock at the time you order.

Bottom line

For DeepSeek-Coder V2 Lite, the RTX 4090 is a solid mid-tier host. The 5090 is the better choice if FP8 is available; the API is cheaper at low volume. See best GPU for DeepSeek.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?