Table of Contents
For RAG with long retrieved context (10K+ tokens), distilling the context to focused 2-3K tokens before the final answer-generation LLM call often improves quality (focused signal) and reduces cost (smaller context for premium model).
Pattern: small LLM (Phi-3 Mini / Mistral 7B) reads long retrieved context + question; outputs a focused 2-3K token distillation. Premium model (Mistral Small 3 / Llama 3.3 70B) generates final answer from distilled context. Net: better quality on hard questions; lower cost on premium model. Two-stage but cheaper than long-context premium inference.
How it works
- Retrieve top-K chunks (typical RAG): 5K-15K tokens of context
- Stage 1: small LLM (Phi-3 Mini) takes (retrieved chunks + question); outputs focused distillation: relevant facts, key quotes, structure (~1-3K tokens)
- Stage 2: premium LLM (Mistral Small 3 / Llama 3.3 70B) takes (focused distillation + question); generates final answer
When useful
- Long retrieved context (10K+ tokens)
- Premium-model final answer (cost-anchored)
- Multi-doc questions where signal-to-noise matters
- Cases where naive long-context premium inference is too expensive
Don't use for short-context RAG; the two-stage overhead isn't earned.
Verdict
Context distillation is one of the highest-ROI patterns for premium-RAG production. Better quality on hard questions; meaningful cost saving on premium-model inference. Stage 1 small LLM cost is negligible vs stage 2 premium savings. Worth implementing for long-context RAG above modest scale.
Bottom line
Two-stage for long-context premium RAG. See context strategies.