You will build a pipeline that takes PDF resumes in any format (scanned images, native PDFs, mixed layouts), extracts text with OCR, and returns structured candidate profiles with experience, skills, education, and contact details in a consistent JSON schema. The end result: your ATS receives standardised candidate data regardless of how creatively the CV was formatted. Process 1,000 CVs in under an hour. No candidate data leaves your infrastructure. Here is the pipeline on dedicated GPU infrastructure.
Pipeline Architecture
| Stage | Tool | Input | Output |
|---|---|---|---|
| 1. PDF extraction | PyPDF2 + pdf2image | PDF file | Text or images |
| 2. OCR (if needed) | PaddleOCR | Page images | Raw text |
| 3. Structured extraction | LLaMA 3.1 8B | Raw text | Candidate JSON |
| 4. Normalisation | Python | Candidate JSON | ATS-ready record |
Smart Text Extraction
from PyPDF2 import PdfReader
from paddleocr import PaddleOCR
from pdf2image import convert_from_path
import numpy as np
ocr = PaddleOCR(use_angle_cls=True, lang='en', use_gpu=True)
def extract_text(pdf_path: str) -> str:
# Try native text extraction first
reader = PdfReader(pdf_path)
text = ""
for page in reader.pages:
page_text = page.extract_text() or ""
text += page_text + "\n"
# Fall back to OCR if native extraction yields little text
if len(text.strip()) < 100:
images = convert_from_path(pdf_path, dpi=300)
text = ""
for img in images:
result = ocr.ocr(np.array(img), cls=True)
for line in result[0]:
if line[1][1] > 0.7:
text += line[1][0] + "\n"
return text
PaddleOCR handles scanned and photographed CVs where native PDF text extraction fails. The confidence threshold filters out OCR noise from decorative elements.
LLM Structured Parsing
from openai import OpenAI
import json
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
def parse_resume(text: str) -> dict:
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{
"role": "system",
"content": """Extract structured data from this CV. Return JSON:
{"name": "", "email": "", "phone": "", "location": "",
"summary": "2-3 sentence professional summary",
"experience": [{"title": "", "company": "", "start": "", "end": "", "description": ""}],
"education": [{"degree": "", "institution": "", "year": ""}],
"skills": ["list"],
"certifications": ["list"],
"languages": ["list"]}
If a field is not found, use null. Parse dates as YYYY-MM format."""
}, {"role": "user", "content": text}],
max_tokens=1500, temperature=0.0
)
return json.loads(response.choices[0].message.content)
The vLLM server processes extraction requests. Temperature 0 ensures consistent parsing across similar CV formats.
Batch Processing API
from fastapi import FastAPI, UploadFile
from typing import List
app = FastAPI()
@app.post("/parse-resumes")
async def parse_resumes(files: List[UploadFile]):
results = []
for file in files:
path = save_upload(file)
text = extract_text(path)
parsed = parse_resume(text)
parsed["source_file"] = file.filename
parsed["confidence"] = calculate_completeness(parsed)
results.append(parsed)
return {"candidates": results, "processed": len(results)}
Data Normalisation
Normalise extracted data before loading into your ATS: standardise date formats, map skill variations to canonical names (e.g., “JS”, “JavaScript”, “javascript” all become “JavaScript”), validate email formats, and flag incomplete profiles for manual review. Maintain a skills taxonomy that maps common variations.
Compliance and Production
CV parsing involves personal data — ensure GDPR compliance with a lawful basis for processing, defined retention periods, and candidate notification. Implement bias testing to verify the parser does not systematically miss information from non-standard CV formats common in specific demographic groups. Deploy on private infrastructure for data protection. See document AI hosting for OCR infrastructure, model options, GDPR compliance, more tutorials, and HR use cases.
HR AI GPU Servers
Dedicated GPU servers for resume parsing and recruitment AI. Process candidate data on isolated, GDPR-compliant UK infrastructure.
Browse GPU Servers