Summarisation is one of the highest-value LLM workloads: meetings, long docs, emails, research papers. The RTX 5060 Ti 16GB at our hosting handles realistic input lengths.
Contents
Input Lengths That Fit
| Config | Max input | Words |
|---|---|---|
| Llama 3 8B FP8 + FP8 KV | 65,536 | ~49k |
| Qwen 2.5 14B AWQ + FP8 KV | 32,768 | ~25k |
| Mistral Nemo 12B FP8 | 24,576 | ~18k |
| Qwen 2.5 7B AWQ + YaRN | 128,000 | ~95k |
Most real documents (meetings 1-2 hours, research papers, long emails) fit in 32k. For books or full contract suites, 128k on Qwen 7B + YaRN is your tool.
Models
- Default: Llama 3.1 8B FP8 for 32k – fastest at good quality
- Quality priority: Qwen 2.5 14B AWQ – better reasoning on complex content
- Long context: Qwen 2.5 7B with YaRN – 128k native
Long-Doc Strategies
- Single-shot: if it fits in context, easiest and best quality
- Map-reduce: chunk -> summarise each -> summarise the summaries
- Sliding window: fixed window of recent content, rolling summary
- RAG-style: retrieve most relevant chunks for a specific question
Single-shot wins on quality when it fits. Map-reduce is the right fallback for anything above your model’s context window.
Prompt Templates
SYSTEM: You are a precise summariser. Output in structured Markdown with
sections: Key Points, Decisions, Action Items, Risks.
USER: Summarise the following text:
---
[document]
---
Add “Only include facts present in the source. Do not invent.” for factually tight domains.
Enable prefix caching – same template across many documents means the system prompt’s KV cache hits every time.
Summarisation on Blackwell 16GB
32k-128k context, fast and private. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: 128k context, long-context perf, document Q&A, webinar transcription.