Table of Contents
For workloads with input that exceeds practical context windows (long documents, multi-doc analysis, extended conversations), several patterns manage context efficiently. Picking the right pattern matters — naive long-context inference is expensive and quality drops at the longest tails.
Five patterns: chunked summarisation (recursive summarise-then-summarise), context compression (LLMLingua-style; remove redundant tokens), sliding window (keep recent N; works for ongoing conversation), hierarchical RAG (multi-level retrieval), extract-then-answer (extract relevant facts; answer from facts). Pick by use case.
Approaches
- Chunked summarisation: split long input; summarise each chunk; summarise summaries. Works for documents that compress well.
- Context compression: LLMLingua and similar reduce token count by removing low-importance tokens. ~50-70% compression with minor quality loss.
- Sliding window: keep last N tokens of conversation. Simple; loses early context.
- Hierarchical RAG: retrieve at multiple granularities (paragraph + section + document); pass relevant levels to LLM.
- Extract-then-answer: small LLM extracts relevant facts from long context; main LLM answers from facts. Two-stage but cheaper than long-context inference.
- Native long context: just use Llama 3.1 8B's 128K. Expensive but quality holds.
Comparison
| Pattern | Cost | Quality on long context | Implementation |
|---|---|---|---|
| Chunked summarisation | Low | Lossy | Simple |
| Context compression | Medium | Moderate loss | Library available |
| Sliding window | Lowest | Loses early context | Trivial |
| Hierarchical RAG | Medium | Strong | Complex |
| Extract-then-answer | Medium | Strong | Medium |
| Native long context | Highest | Best | Trivial |
Verdict
For long-context production workloads, hierarchical RAG and extract-then-answer typically beat naive long-context inference on cost / quality balance. Native long context is the simplest fallback but expensive. Pick by specific use case — documents compress differently than conversations.
Bottom line
Hierarchical RAG / extract-then-answer for cost; native long context for premium. See long-context VRAM.