The DeepSeek R1 distill series puts R1’s reasoning behaviour into smaller base models. The 7B Qwen-distilled variant fits the RTX 5060 Ti 16GB comfortably at FP8 or AWQ on our hosting.
Contents
Fit
DeepSeek-R1-Distill-Qwen-7B:
- FP16: ~14 GB, tight
- FP8: ~7 GB, comfortable KV cache
- AWQ INT4: ~4 GB, room for many concurrent users
Deployment
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--quantization fp8 \
--max-model-len 32768 \
--enable-prefix-caching \
--gpu-memory-utilization 0.92
--max-model-len 32768 is important – reasoning models emit long thinking traces that consume context. Lower and you truncate mid-thought.
Thinking Traces
R1 distills emit <think>...</think> wrapping their reasoning before the final answer. Two display patterns:
- Show thinking to user – builds trust, debugging-friendly
- Strip thinking client-side – cleaner UX for end users
Regex: /<think>[\s\S]*?<\/think>/g
Latency Budget
Reasoning models emit 2-5x more output tokens than non-reasoning models for the same final answer. Typical math problem:
- Non-reasoning 7B: ~1.5 sec response, 80 output tokens
- R1 Distill 7B: ~8 seconds, 700 output tokens (mostly thinking)
Budget SLAs accordingly. For strict latency, route only reasoning-needed queries to R1 and default to a regular model.
Use Cases
- Math and logic problems
- Code generation with self-correction
- Multi-step planning
- Verification of other model outputs
See the 32B variant guide for larger deployment.
Reasoning Model at Mid-Tier
R1 distill on Blackwell 16GB. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: monthly cost analysis, all distilled variants.