Table of Contents
DeepSeek VRAM Requirements Overview
DeepSeek offers a wide range of models from 1.5B distillations to the massive 671B V3 MoE. The VRAM you need depends entirely on which model you are running and at what precision. This guide covers every variant to help you choose the right dedicated GPU server for your DeepSeek deployment.
The key thing to understand about DeepSeek V3 and R1 (full versions) is that they use a Mixture-of-Experts architecture with 671B total parameters but only ~37B active per token. Despite this efficiency during computation, all 671B parameters must be loaded into VRAM.
Complete VRAM Table (All Models)
DeepSeek R1 Distillations
| Model | Parameters | FP32 | FP16 | INT8 | INT4 |
|---|---|---|---|---|---|
| R1-Distill-Qwen-1.5B | 1.5B | ~6 GB | ~3 GB | ~1.6 GB | ~1 GB |
| R1-Distill-Qwen-7B | 7B | ~28 GB | ~14 GB | ~7 GB | ~4.5 GB |
| R1-Distill-Qwen-14B | 14B | ~56 GB | ~28 GB | ~14 GB | ~9 GB |
| R1-Distill-Qwen-32B | 32B | ~128 GB | ~64 GB | ~32 GB | ~20 GB |
| R1-Distill-LLaMA-8B | 8B | ~32 GB | ~16 GB | ~8.5 GB | ~5.5 GB |
| R1-Distill-LLaMA-70B | 70B | ~280 GB | ~140 GB | ~70 GB | ~38 GB |
DeepSeek Full Models
| Model | Parameters | FP32 | FP16 | INT8/FP8 | INT4 |
|---|---|---|---|---|---|
| DeepSeek V2 Lite | 16B MoE | ~64 GB | ~32 GB | ~16 GB | ~10 GB |
| DeepSeek V2 | 236B MoE | ~944 GB | ~472 GB | ~236 GB | ~125 GB |
| DeepSeek V3 | 671B MoE | ~2,684 GB | ~1,342 GB | ~671 GB | ~350 GB |
| DeepSeek R1 (full) | 671B MoE | ~2,684 GB | ~1,342 GB | ~671 GB | ~350 GB |
| DeepSeek Coder V2 Lite | 16B MoE | ~64 GB | ~32 GB | ~16 GB | ~10 GB |
| DeepSeek Coder V2 | 236B MoE | ~944 GB | ~472 GB | ~236 GB | ~125 GB |
Note: FP32 is shown for reference but never used in practice for inference. FP16 is the standard full-precision inference format. For related model comparisons, see our LLaMA 3 VRAM requirements and Qwen VRAM requirements pages.
Which GPU Do You Need?
| GPU | VRAM | Best DeepSeek Model | Precision | Use Case |
|---|---|---|---|---|
| RTX 3050 | 8 GB | R1-Distill-7B | 4-bit | Dev / testing |
| RTX 4060 | 8 GB | R1-Distill-7B | 4-bit | Dev / light API |
| RTX 4060 Ti | 16 GB | R1-Distill-7B / 14B | FP16 / 4-bit | Small production |
| RTX 3090 | 24 GB | R1-Distill-14B / 32B | FP16 / 4-bit | Production |
| 2x RTX 3090 | 48 GB | R1-Distill-32B | FP16 | High quality |
| 8x RTX 6000 Pro 96 GB | 640 GB | DeepSeek V3 / R1 | FP8 | Full model |
For most users, the R1 distillations are the practical choice. The 14B and 32B distillations retain strong reasoning capability from the full R1 model at a fraction of the VRAM cost. See our RTX 3090 DeepSeek V3 analysis for the full model discussion.
Context Length Impact on VRAM
DeepSeek models support long context windows, but longer context means more KV cache VRAM:
| Model | Context Length | KV Cache (FP16) | Total VRAM (FP16 weights) |
|---|---|---|---|
| R1-Distill-7B | 4,096 | ~0.5 GB | ~14.5 GB |
| R1-Distill-7B | 16,384 | ~2 GB | ~16 GB |
| R1-Distill-7B | 32,768 | ~4 GB | ~18 GB |
| R1-Distill-14B | 4,096 | ~1 GB | ~29 GB |
| R1-Distill-14B | 16,384 | ~4 GB | ~32 GB |
| R1-Distill-32B | 4,096 | ~2 GB | ~66 GB |
| R1-Distill-32B | 16,384 | ~8 GB | ~72 GB |
For DeepSeek R1’s chain-of-thought reasoning, longer context is often needed since the model generates lengthy reasoning chains. Budget extra VRAM for this. Use our LLM cost calculator to estimate costs at your target context length.
Batch Size Impact on VRAM
Serving multiple concurrent requests multiplies KV cache usage:
| Model (4-bit) | Batch 1 | Batch 4 | Batch 8 | Batch 16 |
|---|---|---|---|---|
| R1-Distill-7B (4K ctx) | ~5 GB | ~7 GB | ~9 GB | ~13 GB |
| R1-Distill-14B (4K ctx) | ~10 GB | ~14 GB | ~18 GB | ~26 GB |
| R1-Distill-32B (4K ctx) | ~22 GB | ~30 GB | ~38 GB | ~54 GB |
For production APIs serving multiple users, the KV cache quickly becomes the dominant VRAM consumer. Plan your GPU choice around peak concurrent requests, not just single-request VRAM.
Practical Deployment Recommendations
- Personal/dev use: R1-Distill-7B on an RTX 4060 (4-bit) or RTX 4060 Ti (FP16). Fast, cheap, good quality for coding and general tasks.
- Small team (2-5 users): R1-Distill-14B on an RTX 3090 (4-bit). Strong reasoning at 25-30 tok/s with room for concurrent requests.
- Production API: R1-Distill-32B on 2x RTX 3090 or higher. Best quality from the distillation family.
- Maximum capability: Full DeepSeek R1/V3 on multi-GPU clusters (8x RTX 6000 Pro 96 GB minimum).
For cost analysis of self-hosting versus the DeepSeek API, see our cost per 1M tokens comparison. Also check the deploy DeepSeek server tutorial for step-by-step instructions.
Quick Setup Commands
Ollama
# R1 distillation (auto-selects quantization)
curl -fsSL https://ollama.com/install.sh | sh
ollama run deepseek-r1:7b
ollama run deepseek-r1:14b
ollama run deepseek-r1:32b
vLLM
# Serve R1-Distill-14B with AWQ
pip install vllm
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
--quantization awq --max-model-len 8192
For full deployment guides, see our Ollama hosting and vLLM hosting pages. Compare with other models in our best GPU for LLM inference guide and use our benchmark tool for performance comparisons.
Deploy This Model Now
Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.
Browse GPU Servers