Most RAG failures are chunking failures. The embedder can only find what the chunker produced. On dedicated GPU hosting the GPU cost is negligible – chunking runs on CPU or one short LLM pass. The choice of strategy dominates downstream retrieval quality.
Contents
Fixed-Size
Split every 512 tokens. Simple, fast, mediocre. Sentences cut mid-thought, context lost at boundaries. Use as a baseline but rarely optimal.
Semantic
Split where adjacent sentence embeddings diverge beyond a threshold. Chunks become topically coherent. Cost: compute embeddings for every sentence, run a distance threshold pass. On a dedicated GPU this is sub-second per document.
from llama_index.core.node_parser import SemanticSplitterNodeParser
parser = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=your_embedder,
)
Recursive Structure-Aware
For Markdown or code, use document structure – split on headings, then paragraphs, then sentences, recursing until chunks fit size limits. LangChain’s RecursiveCharacterTextSplitter does this. For code, dedicated splitters (Python AST, tree-sitter) preserve function boundaries.
Contextual Chunking
Anthropic’s contextual retrieval pattern: prepend each chunk with a 1-2 sentence summary of what the chunk is about, generated by an LLM. Retrieval embedding is on chunk + context, not chunk alone. Recall improvement: often 20-40%. Cost: one LLM call per chunk at index time. See contextual retrieval pipeline.
| Strategy | Relative Recall | Index Cost |
|---|---|---|
| Fixed 512 | Baseline | Negligible |
| Semantic | +5-10% | Embedder pass |
| Recursive structured | +5-15% | Negligible |
| Contextual | +20-40% | LLM per chunk |
RAG Pipelines Built Right
Chunking and retrieval configured for your content type on UK dedicated GPUs.
Browse GPU ServersSee contextual retrieval and hybrid search.