You will build a pipeline that takes scanned medical reports (lab results, discharge summaries, referral letters), extracts text with OCR, and produces structured clinical data: patient identifiers, diagnoses, medications, test results, and recommended follow-ups. The end result: clinics processing 200 referral letters weekly get structured data in their EHR system within minutes instead of hours of manual data entry. All patient data stays on your infrastructure — critical for Caldicott compliance and UK GDPR. Here is the pipeline on dedicated GPU infrastructure.
Pipeline Architecture
| Stage | Tool | Output | Data Sensitivity |
|---|---|---|---|
| 1. Document ingestion | pdf2image + PaddleOCR | Raw clinical text | Contains patient PII |
| 2. Clinical extraction | LLaMA 3.1 8B | Structured FHIR-like JSON | Contains clinical data |
| 3. Validation | Rule engine | Validated records | Flagged for review |
| 4. De-identification | LLM + regex | Anonymised copy | Research-safe version |
Clinical Document OCR
from paddleocr import PaddleOCR
from pdf2image import convert_from_path
import numpy as np
ocr = PaddleOCR(use_angle_cls=True, lang='en', use_gpu=True)
def extract_clinical_text(pdf_path: str) -> str:
images = convert_from_path(pdf_path, dpi=300)
full_text = []
for img in images:
result = ocr.ocr(np.array(img), cls=True)
page_lines = []
for line in result[0]:
if line[1][1] > 0.75: # Higher threshold for clinical accuracy
page_lines.append(line[1][0])
full_text.append("\n".join(page_lines))
return "\n\n".join(full_text)
PaddleOCR runs with a higher confidence threshold for medical documents where accuracy is critical. Misread digits in lab results or medication dosages could have clinical consequences.
Clinical Data Extraction
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
def extract_clinical_data(text: str) -> dict:
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{
"role": "system",
"content": """Extract structured clinical data. Return JSON:
{"patient": {"name": "", "nhs_number": "", "dob": "", "address": ""},
"document_type": "lab_result|discharge|referral|prescription",
"date": "",
"diagnoses": [{"code": "ICD-10 if identifiable", "description": ""}],
"medications": [{"name": "", "dose": "", "frequency": "", "route": ""}],
"test_results": [{"test": "", "value": "", "unit": "", "reference_range": "", "flag": "normal|high|low"}],
"clinical_summary": "",
"follow_up": ["actions recommended"]}
Use null for fields not present. Flag uncertain extractions with "VERIFY:" prefix."""
}, {"role": "user", "content": text}],
max_tokens=1500, temperature=0.0
)
return parse_json(response.choices[0].message.content)
The vLLM server extracts structured data. The “VERIFY:” prefix on uncertain fields enables the validation stage to flag items for human review.
Clinical Validation Rules
def validate_clinical_data(data: dict) -> dict:
flags = []
# Check medication dosages against known ranges
for med in data.get("medications", []):
if med["dose"] and not validate_dosage(med["name"], med["dose"]):
flags.append(f"Unusual dosage: {med['name']} {med['dose']}")
# Check test results against reference ranges
for test in data.get("test_results", []):
if "VERIFY:" in str(test.get("value", "")):
flags.append(f"OCR uncertain: {test['test']}")
# Verify NHS number format (10 digits with check digit)
nhs = data.get("patient", {}).get("nhs_number", "")
if nhs and not validate_nhs_number(nhs):
flags.append("NHS number format invalid")
data["validation_flags"] = flags
data["requires_review"] = len(flags) > 0
return data
De-Identification for Research
Produce an anonymised copy for research and analytics by removing patient name, NHS number, date of birth, address, and any other identifiers. Replace dates with offsets from a reference point to preserve temporal relationships. This enables clinical research on the extracted data without exposing patient identity.
Compliance and Deployment
Medical data processing requires strict safeguards: Caldicott principles compliance, GDPR lawful basis (typically legitimate interest or explicit consent), NHS Data Security and Protection Toolkit certification if handling NHS data, access logging for audit, and encryption at rest and in transit. Deploy exclusively on private UK infrastructure with GDPR compliance. See document AI hosting for OCR infrastructure, model options, more tutorials, and healthcare use cases. Review infrastructure security for access controls.
Healthcare AI GPU Servers
Dedicated GPU servers for clinical document processing. GDPR-compliant UK infrastructure with encryption and access controls.
Browse GPU Servers