RTX 3050 - Order Now
Home / Blog / Model Guides / DeepSeek VRAM Requirements (All Model Sizes)
Model Guides

DeepSeek VRAM Requirements (All Model Sizes)

Complete DeepSeek VRAM requirements for every model — R1, V3, Coder, and distillations. FP32, FP16, INT8, and INT4 tables plus GPU recommendations.

DeepSeek VRAM Requirements Overview

DeepSeek offers a wide range of models from 1.5B distillations to the massive 671B V3 MoE. The VRAM you need depends entirely on which model you are running and at what precision. This guide covers every variant to help you choose the right dedicated GPU server for your DeepSeek deployment.

The key thing to understand about DeepSeek V3 and R1 (full versions) is that they use a Mixture-of-Experts architecture with 671B total parameters but only ~37B active per token. Despite this efficiency during computation, all 671B parameters must be loaded into VRAM.

Complete VRAM Table (All Models)

DeepSeek R1 Distillations

ModelParametersFP32FP16INT8INT4
R1-Distill-Qwen-1.5B1.5B~6 GB~3 GB~1.6 GB~1 GB
R1-Distill-Qwen-7B7B~28 GB~14 GB~7 GB~4.5 GB
R1-Distill-Qwen-14B14B~56 GB~28 GB~14 GB~9 GB
R1-Distill-Qwen-32B32B~128 GB~64 GB~32 GB~20 GB
R1-Distill-LLaMA-8B8B~32 GB~16 GB~8.5 GB~5.5 GB
R1-Distill-LLaMA-70B70B~280 GB~140 GB~70 GB~38 GB

DeepSeek Full Models

ModelParametersFP32FP16INT8/FP8INT4
DeepSeek V2 Lite16B MoE~64 GB~32 GB~16 GB~10 GB
DeepSeek V2236B MoE~944 GB~472 GB~236 GB~125 GB
DeepSeek V3671B MoE~2,684 GB~1,342 GB~671 GB~350 GB
DeepSeek R1 (full)671B MoE~2,684 GB~1,342 GB~671 GB~350 GB
DeepSeek Coder V2 Lite16B MoE~64 GB~32 GB~16 GB~10 GB
DeepSeek Coder V2236B MoE~944 GB~472 GB~236 GB~125 GB

Note: FP32 is shown for reference but never used in practice for inference. FP16 is the standard full-precision inference format. For related model comparisons, see our LLaMA 3 VRAM requirements and Qwen VRAM requirements pages.

Which GPU Do You Need?

GPUVRAMBest DeepSeek ModelPrecisionUse Case
RTX 30508 GBR1-Distill-7B4-bitDev / testing
RTX 40608 GBR1-Distill-7B4-bitDev / light API
RTX 4060 Ti16 GBR1-Distill-7B / 14BFP16 / 4-bitSmall production
RTX 309024 GBR1-Distill-14B / 32BFP16 / 4-bitProduction
2x RTX 309048 GBR1-Distill-32BFP16High quality
8x RTX 6000 Pro 96 GB640 GBDeepSeek V3 / R1FP8Full model

For most users, the R1 distillations are the practical choice. The 14B and 32B distillations retain strong reasoning capability from the full R1 model at a fraction of the VRAM cost. See our RTX 3090 DeepSeek V3 analysis for the full model discussion.

Context Length Impact on VRAM

DeepSeek models support long context windows, but longer context means more KV cache VRAM:

ModelContext LengthKV Cache (FP16)Total VRAM (FP16 weights)
R1-Distill-7B4,096~0.5 GB~14.5 GB
R1-Distill-7B16,384~2 GB~16 GB
R1-Distill-7B32,768~4 GB~18 GB
R1-Distill-14B4,096~1 GB~29 GB
R1-Distill-14B16,384~4 GB~32 GB
R1-Distill-32B4,096~2 GB~66 GB
R1-Distill-32B16,384~8 GB~72 GB

For DeepSeek R1’s chain-of-thought reasoning, longer context is often needed since the model generates lengthy reasoning chains. Budget extra VRAM for this. Use our LLM cost calculator to estimate costs at your target context length.

Batch Size Impact on VRAM

Serving multiple concurrent requests multiplies KV cache usage:

Model (4-bit)Batch 1Batch 4Batch 8Batch 16
R1-Distill-7B (4K ctx)~5 GB~7 GB~9 GB~13 GB
R1-Distill-14B (4K ctx)~10 GB~14 GB~18 GB~26 GB
R1-Distill-32B (4K ctx)~22 GB~30 GB~38 GB~54 GB

For production APIs serving multiple users, the KV cache quickly becomes the dominant VRAM consumer. Plan your GPU choice around peak concurrent requests, not just single-request VRAM.

Practical Deployment Recommendations

  • Personal/dev use: R1-Distill-7B on an RTX 4060 (4-bit) or RTX 4060 Ti (FP16). Fast, cheap, good quality for coding and general tasks.
  • Small team (2-5 users): R1-Distill-14B on an RTX 3090 (4-bit). Strong reasoning at 25-30 tok/s with room for concurrent requests.
  • Production API: R1-Distill-32B on 2x RTX 3090 or higher. Best quality from the distillation family.
  • Maximum capability: Full DeepSeek R1/V3 on multi-GPU clusters (8x RTX 6000 Pro 96 GB minimum).

For cost analysis of self-hosting versus the DeepSeek API, see our cost per 1M tokens comparison. Also check the deploy DeepSeek server tutorial for step-by-step instructions.

Quick Setup Commands

Ollama

# R1 distillation (auto-selects quantization)
curl -fsSL https://ollama.com/install.sh | sh
ollama run deepseek-r1:7b
ollama run deepseek-r1:14b
ollama run deepseek-r1:32b

vLLM

# Serve R1-Distill-14B with AWQ
pip install vllm
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
  --quantization awq --max-model-len 8192

For full deployment guides, see our Ollama hosting and vLLM hosting pages. Compare with other models in our best GPU for LLM inference guide and use our benchmark tool for performance comparisons.

Deploy This Model Now

Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?