Table of Contents
Fine-tune data quality is the dominant factor in fine-tune outcome. ~1K-5K high-quality examples often beats 100K mediocre ones. Curation discipline matters more than dataset size.
Principles: quality over quantity (~1K-5K curated > 100K noisy), diversity (cover the realistic input distribution), correctness (every example exhibits the right behaviour), brand-voice consistency (for tone / style work). Sources: production logs (with curation), expert-written, synthetic from teacher LLM (with review). Process: curate iteratively; eval after each round.
Principles
- Quality > quantity: 2K well-curated examples typically beat 50K noisy ones
- Diversity: cover the realistic input distribution; over-represented templates leak into model behaviour
- Correctness: every output exhibits exactly the behaviour you want; one bad example teaches the model that bad behaviour
- Brand-voice consistency: outputs should match the voice you want consistently
- Avoid leakage: don't include test prompts in training set
Sources
- Production logs: real prompts + curated good responses. Highest realism; needs filtering for quality.
- Expert-written: SMEs write ideal input/output pairs. Highest quality; lowest scale.
- Synthetic from teacher LLM: distillation from frontier API; needs human review for quality.
- User feedback: edited responses (preserved as "ideal" outputs)
- Hybrid: synthetic generation + expert review + production curation
Process
- Define: what behaviour are we training? Specific use case + tone + format.
- Initial curation: ~500 high-quality examples covering the use case
- First fine-tune (~£10-30 of GPU time)
- Eval on held-out test set
- Identify failure modes; add ~500 examples targeting them
- Re-fine-tune; re-eval
- Iterate until quality bar met (typically 3-5 rounds)
Verdict
Fine-tune data curation is the highest-leverage skill in custom-model production. Spend more time on curation than on model architecture choices. Iterate ~5 rounds; quality grows much faster than 1-shot dump-and-train. Use eval harness to drive curation priorities.
Bottom line
Quality > quantity; iterate. See LoRA guide.