RTX 3050 - Order Now
Home / Blog / Tutorials / Resume Parser Pipeline with OCR and LLM
Tutorials

Resume Parser Pipeline with OCR and LLM

Build a resume parsing pipeline that extracts structured candidate data from PDF CVs using PaddleOCR and an LLM for recruitment automation on a dedicated GPU server.

You will build a pipeline that takes PDF resumes in any format (scanned images, native PDFs, mixed layouts), extracts text with OCR, and returns structured candidate profiles with experience, skills, education, and contact details in a consistent JSON schema. The end result: your ATS receives standardised candidate data regardless of how creatively the CV was formatted. Process 1,000 CVs in under an hour. No candidate data leaves your infrastructure. Here is the pipeline on dedicated GPU infrastructure.

Pipeline Architecture

StageToolInputOutput
1. PDF extractionPyPDF2 + pdf2imagePDF fileText or images
2. OCR (if needed)PaddleOCRPage imagesRaw text
3. Structured extractionLLaMA 3.1 8BRaw textCandidate JSON
4. NormalisationPythonCandidate JSONATS-ready record

Smart Text Extraction

from PyPDF2 import PdfReader
from paddleocr import PaddleOCR
from pdf2image import convert_from_path
import numpy as np

ocr = PaddleOCR(use_angle_cls=True, lang='en', use_gpu=True)

def extract_text(pdf_path: str) -> str:
    # Try native text extraction first
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        page_text = page.extract_text() or ""
        text += page_text + "\n"

    # Fall back to OCR if native extraction yields little text
    if len(text.strip()) < 100:
        images = convert_from_path(pdf_path, dpi=300)
        text = ""
        for img in images:
            result = ocr.ocr(np.array(img), cls=True)
            for line in result[0]:
                if line[1][1] > 0.7:
                    text += line[1][0] + "\n"
    return text

PaddleOCR handles scanned and photographed CVs where native PDF text extraction fails. The confidence threshold filters out OCR noise from decorative elements.

LLM Structured Parsing

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

def parse_resume(text: str) -> dict:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{
            "role": "system",
            "content": """Extract structured data from this CV. Return JSON:
{"name": "", "email": "", "phone": "", "location": "",
 "summary": "2-3 sentence professional summary",
 "experience": [{"title": "", "company": "", "start": "", "end": "", "description": ""}],
 "education": [{"degree": "", "institution": "", "year": ""}],
 "skills": ["list"],
 "certifications": ["list"],
 "languages": ["list"]}
If a field is not found, use null. Parse dates as YYYY-MM format."""
        }, {"role": "user", "content": text}],
        max_tokens=1500, temperature=0.0
    )
    return json.loads(response.choices[0].message.content)

The vLLM server processes extraction requests. Temperature 0 ensures consistent parsing across similar CV formats.

Batch Processing API

from fastapi import FastAPI, UploadFile
from typing import List
app = FastAPI()

@app.post("/parse-resumes")
async def parse_resumes(files: List[UploadFile]):
    results = []
    for file in files:
        path = save_upload(file)
        text = extract_text(path)
        parsed = parse_resume(text)
        parsed["source_file"] = file.filename
        parsed["confidence"] = calculate_completeness(parsed)
        results.append(parsed)
    return {"candidates": results, "processed": len(results)}

Data Normalisation

Normalise extracted data before loading into your ATS: standardise date formats, map skill variations to canonical names (e.g., “JS”, “JavaScript”, “javascript” all become “JavaScript”), validate email formats, and flag incomplete profiles for manual review. Maintain a skills taxonomy that maps common variations.

Compliance and Production

CV parsing involves personal data — ensure GDPR compliance with a lawful basis for processing, defined retention periods, and candidate notification. Implement bias testing to verify the parser does not systematically miss information from non-standard CV formats common in specific demographic groups. Deploy on private infrastructure for data protection. See document AI hosting for OCR infrastructure, model options, GDPR compliance, more tutorials, and HR use cases.

HR AI GPU Servers

Dedicated GPU servers for resume parsing and recruitment AI. Process candidate data on isolated, GDPR-compliant UK infrastructure.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?