RTX 3050 - Order Now
Home / Blog / Tutorials / RAG for Different Document Types: PDF, HTML, Code, Tables
Tutorials

RAG for Different Document Types: PDF, HTML, Code, Tables

Different document types need different RAG strategies. PDF needs OCR, HTML needs cleanup, code needs syntax-aware chunking, tables need their own embedding strategy.

Table of Contents

  1. By document type
  2. Verdict

Generic RAG pipelines treat all docs the same. Quality drops sharply on docs with structure.

TL;DR

Per-type strategies: PDF → PaddleOCR + structure-aware splitting. HTML → readability + CSS selector cleanup. Code → tree-sitter chunking. Tables → row-as-doc or columns-as-keywords.

By document type

  • PDF (text-native): pdftotext + recursive splitter at section boundaries
  • PDF (scanned): PaddleOCR or Mistral OCR → text → split
  • HTML: readability → markdown → recursive splitter
  • Code (Python, JS, etc.): tree-sitter to chunk by function / class
  • Tables (CSV, Excel): row-as-doc with column headers prepended; or LLM-summarised tables
  • Conversation logs: turn-aware splitting with speaker context
  • Legal contracts: clause-level splitting, preserve numbering

Verdict

Don't use one chunking strategy for all docs. Match the strategy to the type.

Bottom line

Per-type strategies improve RAG quality. See chunking strategies.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?