Table of Contents
Generic RAG pipelines treat all docs the same. Quality drops sharply on docs with structure.
Per-type strategies: PDF → PaddleOCR + structure-aware splitting. HTML → readability + CSS selector cleanup. Code → tree-sitter chunking. Tables → row-as-doc or columns-as-keywords.
By document type
- PDF (text-native): pdftotext + recursive splitter at section boundaries
- PDF (scanned): PaddleOCR or Mistral OCR → text → split
- HTML: readability → markdown → recursive splitter
- Code (Python, JS, etc.): tree-sitter to chunk by function / class
- Tables (CSV, Excel): row-as-doc with column headers prepended; or LLM-summarised tables
- Conversation logs: turn-aware splitting with speaker context
- Legal contracts: clause-level splitting, preserve numbering
Verdict
Don't use one chunking strategy for all docs. Match the strategy to the type.
Bottom line
Per-type strategies improve RAG quality. See chunking strategies.