Home / Blog / Tutorials / OCR + LLM Document Summarisation Pipeline

Tutorials

OCR + LLM Document Summarisation Pipeline

Build a pipeline that extracts text from scanned documents with PaddleOCR and generates structured summaries with a self-hosted LLM on dedicated GPU infrastructure.

Tutorials April 16, 2026 3 min read admin

You will build a pipeline that takes a stack of scanned PDF invoices, extracts text using GPU-accelerated OCR, and produces structured summaries with key fields (vendor, amount, date, line items) — all running on your own server. The end result: drop 500 scanned documents into a folder and get a structured JSON summary for each within minutes. No manual data entry, no cloud OCR API fees, no document data leaving your infrastructure. Here is the complete setup on dedicated GPU servers.

Pipeline Architecture

Stage	Tool	Input	Output	VRAM
1. PDF to images	pdf2image	Scanned PDF	Page images	CPU only
2. OCR extraction	PaddleOCR	Page images	Raw text + positions	~1GB
3. LLM summarisation	LLaMA 3.1 8B	Raw OCR text	Structured JSON	~6GB

Total VRAM: approximately 7GB. A 24GB GPU handles this with significant headroom for batch processing.

Environment Setup

# Install dependencies
pip install paddlepaddle-gpu paddleocr pdf2image Pillow vllm fastapi

# System dependency for PDF rendering
apt install poppler-utils -y

# Start vLLM
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization gptq --port 8000 &

PaddleOCR runs GPU-accelerated text detection and recognition. The vLLM server handles the summarisation stage.

Stage 1-2: PDF to OCR Text

from paddleocr import PaddleOCR
from pdf2image import convert_from_path
import numpy as np

ocr = PaddleOCR(use_angle_cls=True, lang='en', use_gpu=True)

def extract_text_from_pdf(pdf_path: str) -> str:
    images = convert_from_path(pdf_path, dpi=300)
    all_text = []
    for page_num, image in enumerate(images):
        img_array = np.array(image)
        result = ocr.ocr(img_array, cls=True)
        page_text = []
        for line in result[0]:
            text = line[1][0]
            confidence = line[1][1]
            if confidence > 0.7:
                page_text.append(text)
        all_text.append(f"--- Page {page_num + 1} ---\n" + "\n".join(page_text))
    return "\n\n".join(all_text)

The confidence threshold of 0.7 filters out low-quality OCR results that would confuse the LLM. For document AI workflows, adjust this threshold based on your document quality.

Stage 3: LLM Structured Extraction

import requests, json

def summarise_document(ocr_text: str, doc_type: str = "invoice") -> dict:
    prompt = f"""Extract structured data from this {doc_type} OCR text.
Return ONLY valid JSON with these fields:
- vendor_name: string
- invoice_number: string
- date: string (YYYY-MM-DD)
- total_amount: number
- currency: string
- line_items: list of {{"description": string, "amount": number}}

OCR Text:
{ocr_text}

JSON:"""

    response = requests.post("http://localhost:8000/v1/completions", json={
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "prompt": prompt,
        "max_tokens": 500,
        "temperature": 0.0
    })
    raw = response.json()["choices"][0]["text"]
    return json.loads(raw.strip())

Setting temperature to 0 produces deterministic structured output. The LLM corrects common OCR errors (misread digits, broken words) because it understands the document context.

Batch Processing

import os, glob

def process_folder(input_dir: str, output_dir: str):
    os.makedirs(output_dir, exist_ok=True)
    pdfs = glob.glob(f"{input_dir}/*.pdf")
    results = []
    for pdf_path in pdfs:
        ocr_text = extract_text_from_pdf(pdf_path)
        summary = summarise_document(ocr_text)
        summary["source_file"] = os.path.basename(pdf_path)
        output_path = os.path.join(output_dir,
            os.path.basename(pdf_path).replace(".pdf", ".json"))
        with open(output_path, "w") as f:
            json.dump(summary, f, indent=2)
        results.append(summary)
    return results

# Process all invoices
results = process_folder("/data/scanned_invoices/", "/data/summaries/")

Improving Accuracy

For production deployments: pre-process images with deskewing and contrast enhancement before OCR; use PaddleOCR’s table recognition mode for documents with structured tables; add validation rules (e.g., total must equal sum of line items) and flag mismatches for human review; and fine-tune the prompt for your specific document types. Store results in a database for search and analytics. Teams handling sensitive documents should deploy on private infrastructure with encryption at rest. See open-source model options for larger models that improve extraction accuracy, explore industry use cases for sector-specific document workflows, and check more tutorials for related pipelines.

Document AI GPU Servers

Dedicated GPU servers for OCR and LLM document processing pipelines. Process sensitive documents on isolated UK infrastructure.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

OCR + LLM Document Summarisation Pipeline

Pipeline Architecture

Environment Setup

Stage 1-2: PDF to OCR Text

Stage 3: LLM Structured Extraction

Batch Processing

Improving Accuracy

Document AI GPU Servers

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

OCR + LLM Document Summarisation Pipeline

Pipeline Architecture

Environment Setup

Stage 1-2: PDF to OCR Text

Stage 3: LLM Structured Extraction

Batch Processing

Improving Accuracy

Document AI GPU Servers

Need a Dedicated GPU Server?

admin

Related Articles

Connect Firebase to Self-Hosted AI on GPU

Coqui TTS Voice Quality: Optimization

vLLM on RTX 5090: Maximum Throughput Configuration

TensorRT-LLM on Dedicated GPU: Optimisation Guide

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?