RTX 3050 - Order Now
Home / Blog / Tutorials / OCR + LLM Document Summarisation Pipeline
Tutorials

OCR + LLM Document Summarisation Pipeline

Build a pipeline that extracts text from scanned documents with PaddleOCR and generates structured summaries with a self-hosted LLM on dedicated GPU infrastructure.

You will build a pipeline that takes a stack of scanned PDF invoices, extracts text using GPU-accelerated OCR, and produces structured summaries with key fields (vendor, amount, date, line items) — all running on your own server. The end result: drop 500 scanned documents into a folder and get a structured JSON summary for each within minutes. No manual data entry, no cloud OCR API fees, no document data leaving your infrastructure. Here is the complete setup on dedicated GPU servers.

Pipeline Architecture

StageToolInputOutputVRAM
1. PDF to imagespdf2imageScanned PDFPage imagesCPU only
2. OCR extractionPaddleOCRPage imagesRaw text + positions~1GB
3. LLM summarisationLLaMA 3.1 8BRaw OCR textStructured JSON~6GB

Total VRAM: approximately 7GB. A 24GB GPU handles this with significant headroom for batch processing.

Environment Setup

# Install dependencies
pip install paddlepaddle-gpu paddleocr pdf2image Pillow vllm fastapi

# System dependency for PDF rendering
apt install poppler-utils -y

# Start vLLM
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization gptq --port 8000 &

PaddleOCR runs GPU-accelerated text detection and recognition. The vLLM server handles the summarisation stage.

Stage 1-2: PDF to OCR Text

from paddleocr import PaddleOCR
from pdf2image import convert_from_path
import numpy as np

ocr = PaddleOCR(use_angle_cls=True, lang='en', use_gpu=True)

def extract_text_from_pdf(pdf_path: str) -> str:
    images = convert_from_path(pdf_path, dpi=300)
    all_text = []
    for page_num, image in enumerate(images):
        img_array = np.array(image)
        result = ocr.ocr(img_array, cls=True)
        page_text = []
        for line in result[0]:
            text = line[1][0]
            confidence = line[1][1]
            if confidence > 0.7:
                page_text.append(text)
        all_text.append(f"--- Page {page_num + 1} ---\n" + "\n".join(page_text))
    return "\n\n".join(all_text)

The confidence threshold of 0.7 filters out low-quality OCR results that would confuse the LLM. For document AI workflows, adjust this threshold based on your document quality.

Stage 3: LLM Structured Extraction

import requests, json

def summarise_document(ocr_text: str, doc_type: str = "invoice") -> dict:
    prompt = f"""Extract structured data from this {doc_type} OCR text.
Return ONLY valid JSON with these fields:
- vendor_name: string
- invoice_number: string
- date: string (YYYY-MM-DD)
- total_amount: number
- currency: string
- line_items: list of {{"description": string, "amount": number}}

OCR Text:
{ocr_text}

JSON:"""

    response = requests.post("http://localhost:8000/v1/completions", json={
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "prompt": prompt,
        "max_tokens": 500,
        "temperature": 0.0
    })
    raw = response.json()["choices"][0]["text"]
    return json.loads(raw.strip())

Setting temperature to 0 produces deterministic structured output. The LLM corrects common OCR errors (misread digits, broken words) because it understands the document context.

Batch Processing

import os, glob

def process_folder(input_dir: str, output_dir: str):
    os.makedirs(output_dir, exist_ok=True)
    pdfs = glob.glob(f"{input_dir}/*.pdf")
    results = []
    for pdf_path in pdfs:
        ocr_text = extract_text_from_pdf(pdf_path)
        summary = summarise_document(ocr_text)
        summary["source_file"] = os.path.basename(pdf_path)
        output_path = os.path.join(output_dir,
            os.path.basename(pdf_path).replace(".pdf", ".json"))
        with open(output_path, "w") as f:
            json.dump(summary, f, indent=2)
        results.append(summary)
    return results

# Process all invoices
results = process_folder("/data/scanned_invoices/", "/data/summaries/")

Improving Accuracy

For production deployments: pre-process images with deskewing and contrast enhancement before OCR; use PaddleOCR’s table recognition mode for documents with structured tables; add validation rules (e.g., total must equal sum of line items) and flag mismatches for human review; and fine-tune the prompt for your specific document types. Store results in a database for search and analytics. Teams handling sensitive documents should deploy on private infrastructure with encryption at rest. See open-source model options for larger models that improve extraction accuracy, explore industry use cases for sector-specific document workflows, and check more tutorials for related pipelines.

Document AI GPU Servers

Dedicated GPU servers for OCR and LLM document processing pipelines. Process sensitive documents on isolated UK infrastructure.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?