Home / Blog / Model Guides / DeepSeek V3 Distilled Models – Self-Hosted Options

Model Guides

DeepSeek V3 Distilled Models – Self-Hosted Options

The distilled variants of DeepSeek V3 (and R1) fit on single GPUs and carry most of the reasoning quality - the practical way to self-host DeepSeek capability.

Model Guides April 19, 2026 2 min read gigagpu

DeepSeek V3 and R1 at full size require datacenter infrastructure. The distilled variants – where DeepSeek’s reasoning behaviour is trained into smaller base models like Llama 3 and Qwen – are the realistic self-hosting path on our dedicated GPU hosting.

The distilled variants
VRAM requirements
Quality versus full model
Deployment

Variants

The practical distilled models you can host:

DeepSeek-R1-Distill-Qwen-1.5B
DeepSeek-R1-Distill-Qwen-7B
DeepSeek-R1-Distill-Qwen-14B
DeepSeek-R1-Distill-Qwen-32B – best quality-to-size ratio
DeepSeek-R1-Distill-Llama-8B
DeepSeek-R1-Distill-Llama-70B

VRAM

Variant	FP16	INT4	Fits On
1.5B	~3 GB	~1 GB	Any card
7B	~14 GB	~4.5 GB	16 GB+ card
14B	~28 GB	~9 GB	24 GB+ card at FP16; 8 GB+ at INT4
32B	~64 GB	~18 GB	96 GB at FP16; 24 GB+ at INT4
70B	~140 GB	~40 GB	Multi-GPU FP16; 48 GB+ at INT4

Quality vs Full Model

Distilled models retain 70-90% of the reasoning quality of the teacher on benchmarks like MATH and GPQA. The 32B distill punches well above its weight on math and logic tasks. For most teams it is the right target – enough quality for production, small enough for a single card.

Deployment

32B Qwen distill on RTX 5090:

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --quantization awq \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.92

Note that reasoning models emit long “thinking” traces before answers. Budget 2-4x the output tokens you would plan for a non-reasoning model.

DeepSeek Reasoning on Dedicated GPUs

R1 distill variants preconfigured on UK hosting, any size that fits your workload.

Browse GPU Servers

For the 32B variant specifically see DeepSeek R1 Distill Qwen 32B. For coding see DeepSeek Coder V2.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

DeepSeek V3 Distilled Models – Self-Hosted Options

Contents

Variants

VRAM

Quality vs Full Model

Deployment

DeepSeek Reasoning on Dedicated GPUs

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

DeepSeek V3 Distilled Models – Self-Hosted Options

Contents

Variants

VRAM

Quality vs Full Model

Deployment

DeepSeek Reasoning on Dedicated GPUs

Need a Dedicated GPU Server?

gigagpu

Related Articles

Gemma 2 for Code Generation & Review: GPU Requirements & Setup

Command R 35B Self-Hosted

Whisper VRAM Requirements (Tiny to Large-v3)

Sentence-BERT vs BGE vs E5: Embedding Model Comparison

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?