Synthetic data generation is one of the most token-hungry workloads in modern NLP: a single instruction-tuning run burns 100M-1B tokens, and a classifier distillation dataset is often ten times larger. The RTX 5060 Ti 16GB on UK dedicated GPU hosting lets you run Llama 3.1 8B FP8 or Qwen 2.5 14B AWQ as a teacher model at fixed monthly cost – turning unbounded generation jobs into an overnight batch rather than a budget decision.
Contents
Why self-host the teacher
| Job size | Tokens | OpenAI gpt-4o-mini | Self-hosted 5060 Ti |
|---|---|---|---|
| Small SFT set | 50M | £23 | Fixed monthly |
| Medium distillation | 500M | £225 | Fixed monthly |
| Large instruct corpus | 5B | £2,250 | Fixed monthly |
| Continuous pretraining feed | 50B/mo | £22,500/mo | Fixed monthly |
The economics flip around 500M tokens per month; above that, dedicated hardware wins outright and you also remove ToS restrictions on training with the outputs.
Generation throughput
With vLLM continuous batching, Llama 3.1 8B FP8 aggregates 720 tokens/second at batch 32, so a 500M-token dataset completes in roughly 190 wall-clock hours – about eight days continuous or two weeks at weekday-only operation. Qwen 2.5 14B AWQ trades half the throughput for stronger reasoning quality, which matters for hard instruction-following tasks.
| Teacher model | Throughput | 500M tokens | Best for |
|---|---|---|---|
| Mistral 7B FP8 | 122 t/s b1 / ~800 agg | 174 h | Short completions |
| Llama 3.1 8B FP8 | 112 t/s b1 / 720 agg | 193 h | General SFT |
| Qwen 2.5 14B AWQ | 70 t/s b1 / ~320 agg | 434 h | Reasoning, code |
| Phi-3 mini FP8 | 285 t/s b1 / ~1,600 agg | 87 h | Simple labels |
Task recipes
- Instruction pairs – seed with topic plus persona, generate user turn then assistant turn with self-critique.
- Classifier training data – few-shot prompt per class with diversity constraints; hardest-negatives sampled from neighbouring classes.
- NER – generate a sentence plus inline span tags using JSON-schema guided output.
- RAG eval sets – given a document, produce answerable and unanswerable question pairs.
- Code-completion – Qwen Coder with docstring-to-implementation prompts.
Quality control
Pair the teacher with a BGE-base embedding deduplicator (10,200 texts/sec on the same card – see embedding throughput) and a BGE-reranker-base filter (3,200 pairs/sec) to drop near-duplicates and low-relevance outputs. Target 5-8% rejection rate; if it exceeds 20% your prompt is under-constrained.
Example YAML config
teacher:
model: meta-llama/Meta-Llama-3.1-8B-Instruct
quant: fp8
backend: vllm
batch: 32
max_new_tokens: 512
temperature: 0.8
top_p: 0.95
task:
type: instruction_pairs
seed_file: seeds.jsonl
target_count: 100000
diversity_threshold: 0.82 # cosine distance
quality:
dedup_embedder: BAAI/bge-base-en-v1.5
reranker: BAAI/bge-reranker-base
min_reranker_score: 0.55
Unlimited synthetic data on Blackwell 16GB
Llama and Qwen teachers at fixed monthly cost. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: vLLM setup, FP8 Llama deployment, Qwen 14B benchmark, classification.