Table of Contents
Knowledge distillation trains a small student model (7B) to mimic a large teacher model (70B / frontier API). For production workloads where you can't afford the teacher's cost / latency, distillation captures most of its quality at fraction of the cost.
Generate teacher outputs on representative inputs (~10K-100K examples). Train student via SFT on (input, teacher-output) pairs — or DPO on (teacher-good, student-naive) pairs. Quality: typically 90-95% of teacher; cost: 10× cheaper. Pattern: train once with frontier API teacher, deploy student self-hosted forever.
Why distil
- Teacher model (Claude 3.7 / GPT-4o / Llama 3.3 70B) is too expensive to run at production scale
- Student model (Mistral 7B / Llama 3.1 8B) is fast and cheap but base quality is lower
- Distillation transfers task-specific quality from teacher to student
- Works particularly well for narrow / specific tasks (your task, not general AI)
Recipe
- Curate ~10K-100K representative inputs (your production prompts)
- Run teacher on each input; capture (input, teacher-output)
- Standard SFT fine-tuning on student: train to produce teacher outputs given inputs
- Optional: DPO with (teacher-output, student-base-output) preference pairs
- Eval student vs teacher on held-out test set
- Deploy student self-hosted; teacher used only for distillation
Cost: teacher API calls during distillation (one-time, ~£100-1000). Student fine-tune (~£10-50). Net: pay for teacher once, run student cheap forever.
Results
Typical quality preservation:
- Narrow tasks (extraction, classification, summarisation): 95-98% of teacher quality
- Multi-task chatbot: 90-93% of teacher
- Complex reasoning: ~85-90% (the harder the task, the bigger the gap)
Verdict
For narrow production tasks where you currently use frontier API, distillation is one of the highest-ROI investments available. ~£500 of teacher cost + a few hours of fine-tuning produces a self-hosted model 10× cheaper than the teacher with most of its quality. Pattern: distil once, run forever.
Bottom line
Distil for narrow production tasks. See Anthropic migration.