DeepSeek-R1 is the open-weight reasoning breakthrough of the last cycle, and the distilled series makes that reasoning capability tractable on single-GPU hardware. On a Blackwell RTX 5060 Ti 16GB you can run DeepSeek-R1-Distill-Qwen-7B at 115 tokens/s FP8 and Distill-Llama-8B at 108 t/s, with reasoning scores that beat GPT-4o on MATH-500. This post covers deployment, configs and how the distills compare to base Llama 3 8B on a Gigagpu UK GPU node.
Contents
- The distill series
- VRAM and precision
- Throughput numbers
- Reasoning vs base Llama 3 8B
- Deployment
- Reasoning-mode tips
The distill series
DeepSeek distilled R1’s reasoning traces into six smaller dense checkpoints. The two that fit comfortably in 16 GB with production-grade throughput are the 7B (Qwen2.5 base) and 8B (Llama 3.1 base) variants.
| Variant | Base | Params | MATH-500 | AIME 2024 | GPQA-D |
|---|---|---|---|---|---|
| Distill-Qwen-1.5B | Qwen2.5 | 1.5B | 83.9 | 28.9 | 33.8 |
| Distill-Qwen-7B | Qwen2.5 | 7B | 92.8 | 55.5 | 49.1 |
| Distill-Llama-8B | Llama 3.1 | 8B | 89.1 | 50.4 | 49.0 |
| Distill-Qwen-14B | Qwen2.5 | 14B | 93.9 | 69.7 | 59.1 |
VRAM and precision
Distill models think a lot – typical traces run 4k-16k reasoning tokens before the final answer – so KV cache sizing matters more than for ordinary chat.
| Variant | Precision | Weights | KV (16k) | Total |
|---|---|---|---|---|
| Distill-Qwen-1.5B | FP16 | 3.1 GB | 0.4 GB | 3.8 GB |
| Distill-Qwen-7B | FP8 | 7.6 GB | 1.8 GB | 9.9 GB |
| Distill-Llama-8B | FP8 | 8.1 GB | 2.6 GB | 11.3 GB |
| Distill-Qwen-14B | AWQ int4 | 8.4 GB | 3.6 GB | 12.5 GB |
Throughput numbers
Measured with vLLM 0.6, reasoning-style output (8k tokens) on the 5060 Ti 16GB.
| Variant | BS=1 t/s | BS=4 agg | BS=8 agg | Full 8k-token trace |
|---|---|---|---|---|
| Distill-Qwen-1.5B FP16 | 285 | 1,010 | 1,640 | 28 s |
| Distill-Qwen-7B FP8 | 115 | 410 | 690 | 70 s |
| Distill-Llama-8B FP8 | 108 | 390 | 640 | 74 s |
| Distill-Qwen-14B AWQ | 68 | 230 | 390 | 118 s |
Reasoning vs base Llama 3 8B
DeepSeek-R1-Distill-Llama-8B shares an architecture with Meta’s base Llama 3.1 8B, so it’s a clean A/B test. Raw throughput is essentially identical (108 vs 112 t/s) because the architecture is the same – the difference is output quality on reasoning-hard tasks.
| Benchmark | Llama 3 8B Instruct | Distill-Llama-8B | Delta |
|---|---|---|---|
| MATH-500 | 30.0 | 89.1 | +59.1 |
| AIME 2024 | 1.3 | 50.4 | +49.1 |
| GPQA-Diamond | 31.4 | 49.0 | +17.6 |
| HumanEval | 59.1 | 77.4 | +18.3 |
| MMLU | 68.4 | 69.1 | +0.7 |
| Throughput (t/s) | 112 | 108 | -4 |
The gap on general knowledge (MMLU) is small, but on any task that requires chain-of-thought the distill is transformative – and at essentially the same token cost.
Deployment
docker run -d --gpus all -p 8000:8000 vllm/vllm-openai:v0.6.3 \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--quantization fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--enable-prefix-caching
Reasoning-mode tips
- Set
max_tokensgenerously – 8k-16k – or reasoning traces get truncated before the final answer. - Use temperature 0.6 with top_p 0.95 as the DeepSeek paper recommends; greedy decoding degrades reasoning quality.
- Strip the
<think>…</think>block from user-facing responses but keep it server-side for debugging. - Budget for roughly 10x more output tokens than a non-reasoning model; factor that into cost planning.
- Distill-Qwen-7B is usually the best VRAM/quality tradeoff on a single 16 GB card.
Deploy DeepSeek-R1 reasoning on a Blackwell GPU
92.8 MATH-500 on one card, FP8 native. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Llama 3 8B benchmark, Qwen 14B benchmark, FP8 Llama deployment, vLLM setup, prefix caching.