DeepSeek V3 and R1 at full size require datacenter infrastructure. The distilled variants – where DeepSeek’s reasoning behaviour is trained into smaller base models like Llama 3 and Qwen – are the realistic self-hosting path on our dedicated GPU hosting.
Contents
Variants
The practical distilled models you can host:
- DeepSeek-R1-Distill-Qwen-1.5B
- DeepSeek-R1-Distill-Qwen-7B
- DeepSeek-R1-Distill-Qwen-14B
- DeepSeek-R1-Distill-Qwen-32B – best quality-to-size ratio
- DeepSeek-R1-Distill-Llama-8B
- DeepSeek-R1-Distill-Llama-70B
VRAM
| Variant | FP16 | INT4 | Fits On |
|---|---|---|---|
| 1.5B | ~3 GB | ~1 GB | Any card |
| 7B | ~14 GB | ~4.5 GB | 16 GB+ card |
| 14B | ~28 GB | ~9 GB | 24 GB+ card at FP16; 8 GB+ at INT4 |
| 32B | ~64 GB | ~18 GB | 96 GB at FP16; 24 GB+ at INT4 |
| 70B | ~140 GB | ~40 GB | Multi-GPU FP16; 48 GB+ at INT4 |
Quality vs Full Model
Distilled models retain 70-90% of the reasoning quality of the teacher on benchmarks like MATH and GPQA. The 32B distill punches well above its weight on math and logic tasks. For most teams it is the right target – enough quality for production, small enough for a single card.
Deployment
32B Qwen distill on RTX 5090:
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
--quantization awq \
--max-model-len 16384 \
--gpu-memory-utilization 0.92
Note that reasoning models emit long “thinking” traces before answers. Budget 2-4x the output tokens you would plan for a non-reasoning model.
DeepSeek Reasoning on Dedicated GPUs
R1 distill variants preconfigured on UK hosting, any size that fits your workload.
Browse GPU ServersFor the 32B variant specifically see DeepSeek R1 Distill Qwen 32B. For coding see DeepSeek Coder V2.