Home / Blog / Tutorials / Resume Parser Pipeline with OCR and LLM

Tutorials

Resume Parser Pipeline with OCR and LLM

Build a resume parsing pipeline that extracts structured candidate data from PDF CVs using PaddleOCR and an LLM for recruitment automation on a dedicated GPU server.

Tutorials April 16, 2026 3 min read gigagpu

You will build a pipeline that takes PDF resumes in any format (scanned images, native PDFs, mixed layouts), extracts text with OCR, and returns structured candidate profiles with experience, skills, education, and contact details in a consistent JSON schema. The end result: your ATS receives standardised candidate data regardless of how creatively the CV was formatted. Process 1,000 CVs in under an hour. No candidate data leaves your infrastructure. Here is the pipeline on dedicated GPU infrastructure.

Pipeline Architecture

Stage	Tool	Input	Output
1. PDF extraction	PyPDF2 + pdf2image	PDF file	Text or images
2. OCR (if needed)	PaddleOCR	Page images	Raw text
3. Structured extraction	LLaMA 3.1 8B	Raw text	Candidate JSON
4. Normalisation	Python	Candidate JSON	ATS-ready record

Smart Text Extraction

from PyPDF2 import PdfReader
from paddleocr import PaddleOCR
from pdf2image import convert_from_path
import numpy as np

ocr = PaddleOCR(use_angle_cls=True, lang='en', use_gpu=True)

def extract_text(pdf_path: str) -> str:
    # Try native text extraction first
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        page_text = page.extract_text() or ""
        text += page_text + "\n"

    # Fall back to OCR if native extraction yields little text
    if len(text.strip()) < 100:
        images = convert_from_path(pdf_path, dpi=300)
        text = ""
        for img in images:
            result = ocr.ocr(np.array(img), cls=True)
            for line in result[0]:
                if line[1][1] > 0.7:
                    text += line[1][0] + "\n"
    return text

PaddleOCR handles scanned and photographed CVs where native PDF text extraction fails. The confidence threshold filters out OCR noise from decorative elements.

LLM Structured Parsing

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

def parse_resume(text: str) -> dict:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{
            "role": "system",
            "content": """Extract structured data from this CV. Return JSON:
{"name": "", "email": "", "phone": "", "location": "",
 "summary": "2-3 sentence professional summary",
 "experience": [{"title": "", "company": "", "start": "", "end": "", "description": ""}],
 "education": [{"degree": "", "institution": "", "year": ""}],
 "skills": ["list"],
 "certifications": ["list"],
 "languages": ["list"]}
If a field is not found, use null. Parse dates as YYYY-MM format."""
        }, {"role": "user", "content": text}],
        max_tokens=1500, temperature=0.0
    )
    return json.loads(response.choices[0].message.content)

The vLLM server processes extraction requests. Temperature 0 ensures consistent parsing across similar CV formats.

Batch Processing API

from fastapi import FastAPI, UploadFile
from typing import List
app = FastAPI()

@app.post("/parse-resumes")
async def parse_resumes(files: List[UploadFile]):
    results = []
    for file in files:
        path = save_upload(file)
        text = extract_text(path)
        parsed = parse_resume(text)
        parsed["source_file"] = file.filename
        parsed["confidence"] = calculate_completeness(parsed)
        results.append(parsed)
    return {"candidates": results, "processed": len(results)}

Data Normalisation

Normalise extracted data before loading into your ATS: standardise date formats, map skill variations to canonical names (e.g., “JS”, “JavaScript”, “javascript” all become “JavaScript”), validate email formats, and flag incomplete profiles for manual review. Maintain a skills taxonomy that maps common variations.

Compliance and Production

CV parsing involves personal data — ensure GDPR compliance with a lawful basis for processing, defined retention periods, and candidate notification. Implement bias testing to verify the parser does not systematically miss information from non-standard CV formats common in specific demographic groups. Deploy on private infrastructure for data protection. See document AI hosting for OCR infrastructure, model options, GDPR compliance, more tutorials, and HR use cases.

HR AI GPU Servers

Dedicated GPU servers for resume parsing and recruitment AI. Process candidate data on isolated, GDPR-compliant UK infrastructure.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Resume Parser Pipeline with OCR and LLM

Pipeline Architecture

Smart Text Extraction

LLM Structured Parsing

Batch Processing API

Data Normalisation

Compliance and Production

HR AI GPU Servers

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Resume Parser Pipeline with OCR and LLM

Pipeline Architecture

Smart Text Extraction

LLM Structured Parsing

Batch Processing API

Data Normalisation

Compliance and Production

HR AI GPU Servers

Need a Dedicated GPU Server?

gigagpu

Related Articles

CUDA Error: Device-Side Assert Triggered (Fix)

vLLM + Nginx: Fixing Proxy Timeout Issues

Social Media Bot: LLM + Image Gen

PyTorch CUDA Version Compatibility Matrix

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?