RTX 3050 - Order Now
Home / Blog / Benchmarks / AI Summarisation Throughput by GPU: Documents Per Hour
Benchmarks

AI Summarisation Throughput by GPU: Documents Per Hour

How many documents per hour can each GPU summarise? Real numbers across the catalogue for typical map-reduce summarisation workloads.

Table of Contents

  1. Setup
  2. Results
  3. Verdict

Document summarisation has a different throughput profile than chat — smaller outputs, longer inputs, batch-friendly. Here are the numbers.

TL;DR

For map-reduce summarisation of 50-page documents using Llama 3.1 8B FP8: RTX 5060 Ti hits ~120 docs/hour, 5090 hits ~280/hour, 6000 Pro hits ~340/hour. Cost per 1,000 documents: ~£0.50-2.

Setup

  • Llama 3.1 8B FP8 via vLLM
  • 50-page input documents (~25K tokens)
  • 4K-token chunks with 200-token overlap
  • Map step: 250-token summary per chunk
  • Reduce step: 500-token final summary

Results

GPUDocs/hourCost per 1,000 docs
RTX 5060 Ti~120£1.95
RTX 3090~145£1.71
RTX 4090~190£2.04
RTX 5080~210£1.51
RTX 5090~280£1.78
RTX 6000 Pro~340£4.49

Verdict

For high-volume summarisation, the 5080 is the cost leader (best per-pound). For absolute throughput, the 5090. The 6000 Pro is over-spec’d for this workload.

Bottom line

Summarisation is one of the cheapest AI workloads to self-host. See summarisation pipeline guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?