Table of Contents
Fine-Tuning DeepSeek Models
DeepSeek has released several model variants, and fine-tuning the right one on a dedicated GPU server depends on your VRAM budget and target use case. The full DeepSeek V3 (671B parameters) is impractical for fine-tuning on most hardware, but the distilled variants — DeepSeek R1 Distill 7B and 8B (based on Qwen and LLaMA architectures) — are excellent candidates that train efficiently with LoRA.
For inference requirements see our DeepSeek VRAM requirements guide. For a comparison of fine-tuning methods, check LoRA vs QLoRA vs full fine-tuning.
Which DeepSeek to Fine-Tune
| Model | Parameters | Base Architecture | Fine-Tuning Feasibility |
|---|---|---|---|
| DeepSeek V3 (full) | 671B (MoE) | Custom MoE | Impractical — requires 500+ GB VRAM for LoRA |
| DeepSeek R1 (full) | 671B (MoE) | Custom MoE | Impractical — same as V3 |
| DeepSeek R1 Distill 7B | 7B (dense) | Qwen 2.5 7B | Excellent — standard 7B fine-tuning requirements |
| DeepSeek R1 Distill 8B | 8B (dense) | LLaMA 3 8B | Excellent — identical to LLaMA 3 8B requirements |
| DeepSeek R1 Distill 14B | 14B (dense) | Qwen 2.5 14B | Good — needs 24+ GB for LoRA |
| DeepSeek R1 Distill 70B | 70B (dense) | LLaMA 3 70B | Possible — multi-GPU required |
The 7B and 8B distilled variants offer the best value: they inherit DeepSeek R1’s reasoning capabilities while being standard dense architectures that fine-tune identically to Qwen 2.5 7B and LLaMA 3 8B respectively.
VRAM Requirements
Requirements for the R1 Distill 7B/8B models at sequence length 512.
| Method | Precision | VRAM (batch=1) | VRAM (batch=4) | Minimum GPU |
|---|---|---|---|---|
| QLoRA (r=16) | INT4 base | ~10 GB | ~14 GB | RTX 4060 Ti (16 GB) |
| QLoRA (r=64) | INT4 base | ~12 GB | ~18 GB | RTX 4060 Ti or RTX 3090 |
| LoRA (r=16) | FP16 base | ~21 GB | ~27 GB | RTX 3090 (24 GB) |
| LoRA (r=64) | FP16 base | ~23 GB | ~31 GB | RTX 5090 (32 GB) |
| Full fine-tuning | FP16 | ~62 GB | ~78 GB | RTX 6000 Pro 96 GB |
For the 14B distill, double the base model VRAM; for the 70B distill, expect requirements similar to LLaMA 3 70B. Use our fine-tuning VRAM calculator for precise estimates.
Setup Guide
Fine-tuning the R1 Distill 8B (LLaMA-based) uses the standard Hugging Face PEFT workflow on a PyTorch GPU server.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16"
)
model = AutoModelForCausalLM.from_pretrained(
"deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
quantization_config=bnb_config,
device_map="auto"
)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
task_type=TaskType.CAUSAL_LM
)
Key considerations for DeepSeek distilled models:
- Chat format: the R1 distill models use the same chat template as their base architectures. The 8B variant follows LLaMA 3 format; the 7B follows Qwen format.
- Reasoning preservation: when fine-tuning for specific domains, include some general reasoning examples in your training data to avoid catastrophic forgetting of R1’s reasoning capabilities.
- Sequence length: R1 distill models support up to 128K context during inference, but training at 512-2048 tokens is typically sufficient and much more VRAM-efficient.
Training Time and Cost
QLoRA r=16, sequence length 512, effective batch size 32, 3 epochs.
| GPU | 1K Examples | 10K Examples | Cost / 10K |
|---|---|---|---|
| RTX 4060 Ti (16 GB) | ~42 min | ~7 hrs | ~£0.70 |
| RTX 3090 (24 GB) | ~24 min | ~4 hrs | ~£0.60 |
| RTX 5090 (32 GB) | ~11 min | ~1.8 hrs | ~£0.63 |
| RTX 6000 Pro 96 GB | ~7 min | ~1.2 hrs | ~£1.44 |
For extended timing data, see our fine-tuning time by GPU benchmarks. Browse all tutorials in the Tutorials category.
Conclusion
Fine-tuning the full DeepSeek V3/R1 is impractical for most users, but the distilled 7B and 8B variants fine-tune identically to standard Qwen and LLaMA models. QLoRA makes this possible on 16 GB GPUs for under £1 per 10K training examples. Use the distilled models to get DeepSeek R1 reasoning quality with domain-specific fine-tuning, all on affordable DeepSeek hosting hardware.
Fine-Tune DeepSeek on Dedicated GPUs
GPU servers pre-loaded with PyTorch, CUDA, and PEFT. Ready for LoRA fine-tuning in minutes.
Browse GPU Servers