Table of Contents
For production fine-tuning, dataset versioning is as important as model versioning — and most teams skip it until they need to reproduce a result and can't. The right tooling makes versioning trivial; the wrong moment to figure this out is during an audit.
Use DVC (Data Version Control) or HuggingFace datasets with commit pinning. Store datasets in S3-compatible storage with content-addressed paths. Pin dataset version in fine-tuning config. Reproduce-from-config should always produce the same model. For SOC 2 / regulatory audits, this is mandatory.
Why version
- Reproducibility: same dataset + same config = same model
- Audit trail: regulator asks "what data was the model trained on?" — answer with confidence
- Debugging: model regressed; was it the dataset change or the config change?
- Right to erasure: GDPR-bound — show that subject's data was removed before next fine-tune
Tools
- DVC: git-style version control for data; integrates with S3 / GCS / Azure
- HuggingFace datasets: built-in versioning via dataset commit SHAs on the Hub
- LakeFS: git-for-data-lakes; useful for very large datasets
- S3 object versioning: simplest; combine with content-addressed paths (sha256-prefixed)
Workflow
- Initial dataset committed to DVC / HF Hub with version tag
- Fine-tuning config references dataset by version (commit SHA, not branch)
- Train run: fine-tuning logs include dataset version + base-model version + config
- Model artefact tagged with all three (dataset / base / config)
- When dataset updated: new version, new fine-tune, new model version
- When subject requests data deletion: scrub from dataset, new version, retrain on schedule
Verdict
Dataset versioning is non-negotiable for any production fine-tuning. The tooling is mature; the cost is one-time setup. The benefit is reproducibility, audit-ability, and GDPR compliance. Skip it and you'll wish you hadn't the first time you can't answer a regulator's question.
Bottom line
Version datasets like code. See SFTTrainer.