RTX 3050 - Order Now
Home / Blog / Tutorials / Medical Report Processing with OCR and LLM
Tutorials

Medical Report Processing with OCR and LLM

Build a medical report processing pipeline that extracts structured clinical data from scanned reports using PaddleOCR and an LLM on GDPR-compliant dedicated GPU infrastructure.

You will build a pipeline that takes scanned medical reports (lab results, discharge summaries, referral letters), extracts text with OCR, and produces structured clinical data: patient identifiers, diagnoses, medications, test results, and recommended follow-ups. The end result: clinics processing 200 referral letters weekly get structured data in their EHR system within minutes instead of hours of manual data entry. All patient data stays on your infrastructure — critical for Caldicott compliance and UK GDPR. Here is the pipeline on dedicated GPU infrastructure.

Pipeline Architecture

StageToolOutputData Sensitivity
1. Document ingestionpdf2image + PaddleOCRRaw clinical textContains patient PII
2. Clinical extractionLLaMA 3.1 8BStructured FHIR-like JSONContains clinical data
3. ValidationRule engineValidated recordsFlagged for review
4. De-identificationLLM + regexAnonymised copyResearch-safe version

Clinical Document OCR

from paddleocr import PaddleOCR
from pdf2image import convert_from_path
import numpy as np

ocr = PaddleOCR(use_angle_cls=True, lang='en', use_gpu=True)

def extract_clinical_text(pdf_path: str) -> str:
    images = convert_from_path(pdf_path, dpi=300)
    full_text = []
    for img in images:
        result = ocr.ocr(np.array(img), cls=True)
        page_lines = []
        for line in result[0]:
            if line[1][1] > 0.75:  # Higher threshold for clinical accuracy
                page_lines.append(line[1][0])
        full_text.append("\n".join(page_lines))
    return "\n\n".join(full_text)

PaddleOCR runs with a higher confidence threshold for medical documents where accuracy is critical. Misread digits in lab results or medication dosages could have clinical consequences.

Clinical Data Extraction

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

def extract_clinical_data(text: str) -> dict:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{
            "role": "system",
            "content": """Extract structured clinical data. Return JSON:
{"patient": {"name": "", "nhs_number": "", "dob": "", "address": ""},
 "document_type": "lab_result|discharge|referral|prescription",
 "date": "",
 "diagnoses": [{"code": "ICD-10 if identifiable", "description": ""}],
 "medications": [{"name": "", "dose": "", "frequency": "", "route": ""}],
 "test_results": [{"test": "", "value": "", "unit": "", "reference_range": "", "flag": "normal|high|low"}],
 "clinical_summary": "",
 "follow_up": ["actions recommended"]}
Use null for fields not present. Flag uncertain extractions with "VERIFY:" prefix."""
        }, {"role": "user", "content": text}],
        max_tokens=1500, temperature=0.0
    )
    return parse_json(response.choices[0].message.content)

The vLLM server extracts structured data. The “VERIFY:” prefix on uncertain fields enables the validation stage to flag items for human review.

Clinical Validation Rules

def validate_clinical_data(data: dict) -> dict:
    flags = []
    # Check medication dosages against known ranges
    for med in data.get("medications", []):
        if med["dose"] and not validate_dosage(med["name"], med["dose"]):
            flags.append(f"Unusual dosage: {med['name']} {med['dose']}")

    # Check test results against reference ranges
    for test in data.get("test_results", []):
        if "VERIFY:" in str(test.get("value", "")):
            flags.append(f"OCR uncertain: {test['test']}")

    # Verify NHS number format (10 digits with check digit)
    nhs = data.get("patient", {}).get("nhs_number", "")
    if nhs and not validate_nhs_number(nhs):
        flags.append("NHS number format invalid")

    data["validation_flags"] = flags
    data["requires_review"] = len(flags) > 0
    return data

De-Identification for Research

Produce an anonymised copy for research and analytics by removing patient name, NHS number, date of birth, address, and any other identifiers. Replace dates with offsets from a reference point to preserve temporal relationships. This enables clinical research on the extracted data without exposing patient identity.

Compliance and Deployment

Medical data processing requires strict safeguards: Caldicott principles compliance, GDPR lawful basis (typically legitimate interest or explicit consent), NHS Data Security and Protection Toolkit certification if handling NHS data, access logging for audit, and encryption at rest and in transit. Deploy exclusively on private UK infrastructure with GDPR compliance. See document AI hosting for OCR infrastructure, model options, more tutorials, and healthcare use cases. Review infrastructure security for access controls.

Healthcare AI GPU Servers

Dedicated GPU servers for clinical document processing. GDPR-compliant UK infrastructure with encryption and access controls.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?