RTX 3050 - Order Now
Home / Blog / Model Guides / DeepSeek V3 Distilled Models – Self-Hosted Options
Model Guides

DeepSeek V3 Distilled Models – Self-Hosted Options

The distilled variants of DeepSeek V3 (and R1) fit on single GPUs and carry most of the reasoning quality - the practical way to self-host DeepSeek capability.

DeepSeek V3 and R1 at full size require datacenter infrastructure. The distilled variants – where DeepSeek’s reasoning behaviour is trained into smaller base models like Llama 3 and Qwen – are the realistic self-hosting path on our dedicated GPU hosting.

Contents

Variants

The practical distilled models you can host:

  • DeepSeek-R1-Distill-Qwen-1.5B
  • DeepSeek-R1-Distill-Qwen-7B
  • DeepSeek-R1-Distill-Qwen-14B
  • DeepSeek-R1-Distill-Qwen-32B – best quality-to-size ratio
  • DeepSeek-R1-Distill-Llama-8B
  • DeepSeek-R1-Distill-Llama-70B

VRAM

VariantFP16INT4Fits On
1.5B~3 GB~1 GBAny card
7B~14 GB~4.5 GB16 GB+ card
14B~28 GB~9 GB24 GB+ card at FP16; 8 GB+ at INT4
32B~64 GB~18 GB96 GB at FP16; 24 GB+ at INT4
70B~140 GB~40 GBMulti-GPU FP16; 48 GB+ at INT4

Quality vs Full Model

Distilled models retain 70-90% of the reasoning quality of the teacher on benchmarks like MATH and GPQA. The 32B distill punches well above its weight on math and logic tasks. For most teams it is the right target – enough quality for production, small enough for a single card.

Deployment

32B Qwen distill on RTX 5090:

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --quantization awq \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.92

Note that reasoning models emit long “thinking” traces before answers. Budget 2-4x the output tokens you would plan for a non-reasoning model.

DeepSeek Reasoning on Dedicated GPUs

R1 distill variants preconfigured on UK hosting, any size that fits your workload.

Browse GPU Servers

For the 32B variant specifically see DeepSeek R1 Distill Qwen 32B. For coding see DeepSeek Coder V2.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?