Table of Contents
Summarisation of long documents (50+ pages) doesn’t fit a single LLM context window. The standard pattern is map-reduce: chunk the doc, summarise each chunk, then summarise the summaries.
For a self-hosted summarisation API: Llama 3.1 8B FP8 on RTX 5090. Chunk to 4K tokens with 200-token overlap, map-summarise each chunk in parallel, reduce-summarise the chunk summaries. ~30 seconds for a 100-page document.
The map-reduce pattern
- Split document into 4K-token chunks with 200-token overlap
- For each chunk, ask the LLM to extract key points
- Concatenate all chunk summaries
- Ask the LLM to write a final summary from the concatenated summaries
Two parameters that matter: chunk size (smaller = more parallelism but more reduce work; larger = fewer chunks but loses detail), and reduce strategy (single-pass for <20 chunks, hierarchical for 20+).
Model picks
- Llama 3.1 8B: default, good balance
- Mistral 7B: stronger on extractive summarisation
- Qwen 2.5 14B: better on hard reasoning summaries
- Long-context Phi-3 (128K): skip the map-reduce entirely if doc is <128K tokens
Hardware sizing
- ~5 chunks/second on a 5090 with Mistral 7B FP8
- 100-page doc (~200 chunks): ~40 seconds map + ~5 seconds reduce
- For high-volume summarisation API, batch chunks across requests for amortised throughput
Verdict
Summarisation is one of the cheapest AI workloads to self-host — short outputs, predictable cost, well-suited to batch processing.
Bottom line
Map-reduce summarisation on a 5090 handles enterprise document workflows comfortably. See RAG architecture for the related retrieval pattern.