Contextual retrieval is the technique of prepending each document chunk with a 1-2 sentence summary of what the chunk is about, generated by an LLM with access to the full document. It is one of the highest-leverage RAG improvements available – and on dedicated GPU hosting the index-time cost is reasonable.
Contents
Why It Wins
A chunk taken out of context loses information. Paragraph 7 of a legal memo referring to “the respondent” has no meaning on its own. Prepending “This chunk is from a breach-of-contract memo discussing the respondent’s counterclaim” makes the chunk independently retrievable. Recall typically jumps 20-40% on documents with deictic references.
Pipeline
For each document:
- Chunk it (semantic or recursive)
- For each chunk, prompt an LLM: “Given this full document, produce a 1-2 sentence context for this chunk.”
- Prepend the context to the chunk
- Embed the combined text
- Store both the raw chunk (for display) and the embedded version (for search)
prompt = f"""
Here is the full document:
{full_doc}
Here is one chunk from it:
{chunk}
Write 1-2 sentences describing what this chunk is about in the context of the full document.
"""
context = llm.complete(prompt)
combined = f"{context}\n\n{chunk}"
embedding = embedder.encode(combined)
Cost
Llama 3 8B INT8 on a 5080 generating contexts:
- ~100 input tokens (chunk) + ~500 input tokens (document summary) + ~40 output tokens
- Time per chunk: ~1-2 seconds
- 10,000 chunks: ~3-5 hours
For a one-time indexing run this is cheap. For continuously changing corpora, amortise the cost by only recomputing context when a document changes.
Tips
- Use a smaller, faster LLM (Llama 8B, Qwen 7B) – the task is easy
- Cache document summaries once, reuse for every chunk of that document
- Combine with prompt caching on your serving engine for 2-3x speed-up
- Evaluate on a held-out set – if gains are under 10%, your chunking already captures context
Contextual RAG Pipelines
Pre-built contextual retrieval on UK dedicated GPUs with both embedder and summariser LLM.
Browse GPU Servers