Document summarisation has a different throughput profile than chat — smaller outputs, longer inputs, batch-friendly. Here are the numbers.
For map-reduce summarisation of 50-page documents using Llama 3.1 8B FP8: RTX 5060 Ti hits ~120 docs/hour, 5090 hits ~280/hour, 6000 Pro hits ~340/hour. Cost per 1,000 documents: ~£0.50-2.
Setup
- Llama 3.1 8B FP8 via vLLM
- 50-page input documents (~25K tokens)
- 4K-token chunks with 200-token overlap
- Map step: 250-token summary per chunk
- Reduce step: 500-token final summary
Results
| GPU | Docs/hour | Cost per 1,000 docs |
|---|---|---|
| RTX 5060 Ti | ~120 | £1.95 |
| RTX 3090 | ~145 | £1.71 |
| RTX 4090 | ~190 | £2.04 |
| RTX 5080 | ~210 | £1.51 |
| RTX 5090 | ~280 | £1.78 |
| RTX 6000 Pro | ~340 | £4.49 |
Verdict
For high-volume summarisation, the 5080 is the cost leader (best per-pound). For absolute throughput, the 5090. The 6000 Pro is over-spec’d for this workload.
Bottom line
Summarisation is one of the cheapest AI workloads to self-host. See summarisation pipeline guide.