Synthetic data generation is the quiet workhorse of modern fine-tuning. For most teams the bottleneck is not model training but training data. A good-sized dedicated GPU on our hosting can generate tens of thousands of training examples per day using a strong open-weights LLM.
Contents
Pattern
A typical pipeline:
- Seed prompts or documents (your source material)
- Generator LLM writes (question, answer) pairs or (instruction, response) pairs
- Optional: critic LLM scores and filters
- Optional: deduplicate by embedding similarity
- Use the resulting dataset for SFT or DPO
Models
| Use Case | Recommended Generator |
|---|---|
| Code data | Qwen Coder 32B |
| Reasoning / math | R1 Distill 32B |
| General instruction | Llama 3.3 70B or Qwen 2.5 72B |
| Preference pairs (chosen/rejected) | Same strong model generates both, differing prompts |
Quality
Three high-leverage quality controls:
- Use a critic model to score each sample. Drop the bottom 20-40%.
- Deduplicate by sentence embedding – near-duplicates teach nothing.
- Hand-audit a random 100-sample subset per 10,000 generated. You will find systematic failures fast.
Throughput
Llama 3.3 70B INT4 on a 6000 Pro generating (instruction, response) pairs:
- Typical sample: 200 input + 400 output tokens
- Batch 16 throughput: ~400 tokens/sec aggregate
- Per-hour sample generation: ~2,400 samples
- Per-day (24h): ~57,000 samples
Enough for a solid instruction-tuning dataset in a single weekend run.
Self-Hosted Synthetic Data Pipeline
Generate training data on your own UK dedicated GPU without API costs or data leaving your server.
Browse GPU ServersSee DPO training and SFTTrainer.