Table of Contents
RAG performs as well as the source data allows. Garbage in, garbage retrieved. Source data quality often gets less attention than embeddings / models, but it's the bottleneck for many production RAG systems. Three categories of fix matter.
Three fix categories: cleaning (boilerplate removal, format normalisation, OCR error correction), deduplication (near-duplicate detection; remove redundant chunks), structure extraction (preserve hierarchy, tables, code blocks via document-aware processing). Invest 30-50% of RAG project time here for best quality returns.
Common issues
- Boilerplate: nav menus, footers, cookie banners scraped into chunks
- OCR errors: PDFs scanned poorly; characters misread
- Format inconsistency: markdown / HTML / plaintext mixed
- Near-duplicates: same content in multiple sources / drafts
- Stale content: outdated information that should be removed or marked
- Structure loss: tables flattened to text; headings ignored; code blocks broken
- Cross-language confusion: mixed-language docs without language tags
Fixes
- Boilerplate removal: trafilatura / readability for HTML; custom regex for known patterns
- OCR quality check: confidence thresholds; reject low-confidence pages from index
- Deduplication: MinHash / SimHash; remove near-duplicates
- Structure preservation: PaddleOCR PP-Structure for PDFs; markdown-aware splitter for tech docs
- Freshness flagging: mark content with last-updated; bias retrieval toward recent
- Language detection: tag chunks; route queries to language-specific embeddings if needed
Ongoing
- Re-ingestion pipeline runs periodically to catch source updates
- Eval harness measures retrieval quality; regressions point at data issues
- Per-source quality dashboard helps prioritise cleaning effort
- Schedule full re-ingest quarterly with updated cleaning rules
Verdict
For RAG quality, source data hygiene is one of the highest-ROI investments. Clean before embedding; dedupe to reduce noise; preserve structure to enable better retrieval. Allocate 30-50% of project time to data quality — it pays back via every downstream metric.
Bottom line
Garbage in, garbage retrieved. See chunking.