Home / Blog / Benchmarks / AI Summarisation Throughput by GPU: Documents Per Hour

Benchmarks

AI Summarisation Throughput by GPU: Documents Per Hour

How many documents per hour can each GPU summarise? Real numbers across the catalogue for typical map-reduce summarisation workloads.

Benchmarks May 5, 2026 1 min read gigagpu

Table of Contents

Document summarisation has a different throughput profile than chat — smaller outputs, longer inputs, batch-friendly. Here are the numbers.

TL;DR

For map-reduce summarisation of 50-page documents using Llama 3.1 8B FP8: RTX 5060 Ti hits ~120 docs/hour, 5090 hits ~280/hour, 6000 Pro hits ~340/hour. Cost per 1,000 documents: ~£0.50-2.

Setup

Llama 3.1 8B FP8 via vLLM
50-page input documents (~25K tokens)
4K-token chunks with 200-token overlap
Map step: 250-token summary per chunk
Reduce step: 500-token final summary

Results

GPU	Docs/hour	Cost per 1,000 docs
RTX 5060 Ti	~120	£1.95
RTX 3090	~145	£1.71
RTX 4090	~190	£2.04
RTX 5080	~210	£1.51
RTX 5090	~280	£1.78
RTX 6000 Pro	~340	£4.49

Verdict

For high-volume summarisation, the 5080 is the cost leader (best per-pound). For absolute throughput, the 5090. The 6000 Pro is over-spec’d for this workload.

Bottom line

Summarisation is one of the cheapest AI workloads to self-host. See summarisation pipeline guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

AI Summarisation Throughput by GPU: Documents Per Hour

Setup

Results

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

AI Summarisation Throughput by GPU: Documents Per Hour

Setup

Results

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

PaddleOCR on RTX 3090: OCR Speed & Cost, Category: Benchmarks, Slug: paddleocr-on-rtx-3090-benchmark, Excerpt: PaddleOCR benchmarked on RTX 3090: 52 pages/sec, VRAM usage, cost efficiency, and deployment configuration., Internal links: 8 –>

Qwen 2.5 7B on RTX 3090: Performance Benchmark & Cost, Category: Benchmarks, Slug: qwen-2.5-7b-on-rtx-3090-benchmark, Excerpt: Qwen 2.5 7B benchmarked on RTX 3090: 43.0 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

DeepSeek 7B on RTX 3090: Performance Benchmark & Cost, Category: Benchmarks, Slug: deepseek-7b-on-rtx-3090-benchmark, Excerpt: DeepSeek 7B benchmarked on RTX 3090: 44.0 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Memory-Mapped Model Loading

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?