Home / Blog / Tutorials / Medical Report Processing with OCR and LLM

Tutorials

Medical Report Processing with OCR and LLM

Build a medical report processing pipeline that extracts structured clinical data from scanned reports using PaddleOCR and an LLM on GDPR-compliant dedicated GPU infrastructure.

Tutorials April 16, 2026 3 min read admin

You will build a pipeline that takes scanned medical reports (lab results, discharge summaries, referral letters), extracts text with OCR, and produces structured clinical data: patient identifiers, diagnoses, medications, test results, and recommended follow-ups. The end result: clinics processing 200 referral letters weekly get structured data in their EHR system within minutes instead of hours of manual data entry. All patient data stays on your infrastructure — critical for Caldicott compliance and UK GDPR. Here is the pipeline on dedicated GPU infrastructure.

Pipeline Architecture

Stage	Tool	Output	Data Sensitivity
1. Document ingestion	pdf2image + PaddleOCR	Raw clinical text	Contains patient PII
2. Clinical extraction	LLaMA 3.1 8B	Structured FHIR-like JSON	Contains clinical data
3. Validation	Rule engine	Validated records	Flagged for review
4. De-identification	LLM + regex	Anonymised copy	Research-safe version

Clinical Document OCR

from paddleocr import PaddleOCR
from pdf2image import convert_from_path
import numpy as np

ocr = PaddleOCR(use_angle_cls=True, lang='en', use_gpu=True)

def extract_clinical_text(pdf_path: str) -> str:
    images = convert_from_path(pdf_path, dpi=300)
    full_text = []
    for img in images:
        result = ocr.ocr(np.array(img), cls=True)
        page_lines = []
        for line in result[0]:
            if line[1][1] > 0.75:  # Higher threshold for clinical accuracy
                page_lines.append(line[1][0])
        full_text.append("\n".join(page_lines))
    return "\n\n".join(full_text)

PaddleOCR runs with a higher confidence threshold for medical documents where accuracy is critical. Misread digits in lab results or medication dosages could have clinical consequences.

Clinical Data Extraction

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

def extract_clinical_data(text: str) -> dict:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{
            "role": "system",
            "content": """Extract structured clinical data. Return JSON:
{"patient": {"name": "", "nhs_number": "", "dob": "", "address": ""},
 "document_type": "lab_result|discharge|referral|prescription",
 "date": "",
 "diagnoses": [{"code": "ICD-10 if identifiable", "description": ""}],
 "medications": [{"name": "", "dose": "", "frequency": "", "route": ""}],
 "test_results": [{"test": "", "value": "", "unit": "", "reference_range": "", "flag": "normal|high|low"}],
 "clinical_summary": "",
 "follow_up": ["actions recommended"]}
Use null for fields not present. Flag uncertain extractions with "VERIFY:" prefix."""
        }, {"role": "user", "content": text}],
        max_tokens=1500, temperature=0.0
    )
    return parse_json(response.choices[0].message.content)

The vLLM server extracts structured data. The “VERIFY:” prefix on uncertain fields enables the validation stage to flag items for human review.

Clinical Validation Rules

def validate_clinical_data(data: dict) -> dict:
    flags = []
    # Check medication dosages against known ranges
    for med in data.get("medications", []):
        if med["dose"] and not validate_dosage(med["name"], med["dose"]):
            flags.append(f"Unusual dosage: {med['name']} {med['dose']}")

    # Check test results against reference ranges
    for test in data.get("test_results", []):
        if "VERIFY:" in str(test.get("value", "")):
            flags.append(f"OCR uncertain: {test['test']}")

    # Verify NHS number format (10 digits with check digit)
    nhs = data.get("patient", {}).get("nhs_number", "")
    if nhs and not validate_nhs_number(nhs):
        flags.append("NHS number format invalid")

    data["validation_flags"] = flags
    data["requires_review"] = len(flags) > 0
    return data

De-Identification for Research

Produce an anonymised copy for research and analytics by removing patient name, NHS number, date of birth, address, and any other identifiers. Replace dates with offsets from a reference point to preserve temporal relationships. This enables clinical research on the extracted data without exposing patient identity.

Compliance and Deployment

Medical data processing requires strict safeguards: Caldicott principles compliance, GDPR lawful basis (typically legitimate interest or explicit consent), NHS Data Security and Protection Toolkit certification if handling NHS data, access logging for audit, and encryption at rest and in transit. Deploy exclusively on private UK infrastructure with GDPR compliance. See document AI hosting for OCR infrastructure, model options, more tutorials, and healthcare use cases. Review infrastructure security for access controls.

Healthcare AI GPU Servers

Dedicated GPU servers for clinical document processing. GDPR-compliant UK infrastructure with encryption and access controls.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Medical Report Processing with OCR and LLM

Pipeline Architecture

Clinical Document OCR

Clinical Data Extraction

Clinical Validation Rules

De-Identification for Research

Compliance and Deployment

Healthcare AI GPU Servers

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Medical Report Processing with OCR and LLM

Pipeline Architecture

Clinical Document OCR

Clinical Data Extraction

Clinical Validation Rules

De-Identification for Research

Compliance and Deployment

Healthcare AI GPU Servers

Need a Dedicated GPU Server?

admin

Related Articles

Connect MinIO to GPU for Model Storage

LlamaIndex with Self-Hosted Models: RAG Setup

RTX 5060 Ti 16GB Load Test Guide

vLLM Continuous Batching Tuning Guide

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?