You will build a pipeline that takes a stack of scanned PDF invoices, extracts text using GPU-accelerated OCR, and produces structured summaries with key fields (vendor, amount, date, line items) — all running on your own server. The end result: drop 500 scanned documents into a folder and get a structured JSON summary for each within minutes. No manual data entry, no cloud OCR API fees, no document data leaving your infrastructure. Here is the complete setup on dedicated GPU servers.
Pipeline Architecture
| Stage | Tool | Input | Output | VRAM |
|---|---|---|---|---|
| 1. PDF to images | pdf2image | Scanned PDF | Page images | CPU only |
| 2. OCR extraction | PaddleOCR | Page images | Raw text + positions | ~1GB |
| 3. LLM summarisation | LLaMA 3.1 8B | Raw OCR text | Structured JSON | ~6GB |
Total VRAM: approximately 7GB. A 24GB GPU handles this with significant headroom for batch processing.
Environment Setup
# Install dependencies
pip install paddlepaddle-gpu paddleocr pdf2image Pillow vllm fastapi
# System dependency for PDF rendering
apt install poppler-utils -y
# Start vLLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization gptq --port 8000 &
PaddleOCR runs GPU-accelerated text detection and recognition. The vLLM server handles the summarisation stage.
Stage 1-2: PDF to OCR Text
from paddleocr import PaddleOCR
from pdf2image import convert_from_path
import numpy as np
ocr = PaddleOCR(use_angle_cls=True, lang='en', use_gpu=True)
def extract_text_from_pdf(pdf_path: str) -> str:
images = convert_from_path(pdf_path, dpi=300)
all_text = []
for page_num, image in enumerate(images):
img_array = np.array(image)
result = ocr.ocr(img_array, cls=True)
page_text = []
for line in result[0]:
text = line[1][0]
confidence = line[1][1]
if confidence > 0.7:
page_text.append(text)
all_text.append(f"--- Page {page_num + 1} ---\n" + "\n".join(page_text))
return "\n\n".join(all_text)
The confidence threshold of 0.7 filters out low-quality OCR results that would confuse the LLM. For document AI workflows, adjust this threshold based on your document quality.
Stage 3: LLM Structured Extraction
import requests, json
def summarise_document(ocr_text: str, doc_type: str = "invoice") -> dict:
prompt = f"""Extract structured data from this {doc_type} OCR text.
Return ONLY valid JSON with these fields:
- vendor_name: string
- invoice_number: string
- date: string (YYYY-MM-DD)
- total_amount: number
- currency: string
- line_items: list of {{"description": string, "amount": number}}
OCR Text:
{ocr_text}
JSON:"""
response = requests.post("http://localhost:8000/v1/completions", json={
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": prompt,
"max_tokens": 500,
"temperature": 0.0
})
raw = response.json()["choices"][0]["text"]
return json.loads(raw.strip())
Setting temperature to 0 produces deterministic structured output. The LLM corrects common OCR errors (misread digits, broken words) because it understands the document context.
Batch Processing
import os, glob
def process_folder(input_dir: str, output_dir: str):
os.makedirs(output_dir, exist_ok=True)
pdfs = glob.glob(f"{input_dir}/*.pdf")
results = []
for pdf_path in pdfs:
ocr_text = extract_text_from_pdf(pdf_path)
summary = summarise_document(ocr_text)
summary["source_file"] = os.path.basename(pdf_path)
output_path = os.path.join(output_dir,
os.path.basename(pdf_path).replace(".pdf", ".json"))
with open(output_path, "w") as f:
json.dump(summary, f, indent=2)
results.append(summary)
return results
# Process all invoices
results = process_folder("/data/scanned_invoices/", "/data/summaries/")
Improving Accuracy
For production deployments: pre-process images with deskewing and contrast enhancement before OCR; use PaddleOCR’s table recognition mode for documents with structured tables; add validation rules (e.g., total must equal sum of line items) and flag mismatches for human review; and fine-tune the prompt for your specific document types. Store results in a database for search and analytics. Teams handling sensitive documents should deploy on private infrastructure with encryption at rest. See open-source model options for larger models that improve extraction accuracy, explore industry use cases for sector-specific document workflows, and check more tutorials for related pipelines.
Document AI GPU Servers
Dedicated GPU servers for OCR and LLM document processing pipelines. Process sensitive documents on isolated UK infrastructure.
Browse GPU Servers