RTX 3050 - Order Now
Home / Blog / Tutorials / Migrate from HF Endpoints: Named Entity Recognition
Tutorials

Migrate from HF Endpoints: Named Entity Recognition

Move NER pipelines from Hugging Face Inference Endpoints to dedicated GPUs for batch-optimised entity extraction, custom entity types, and fixed-cost processing at any volume.

Named Entity Recognition at Scale Is an Infrastructure Problem, Not an API Problem

A legal tech startup processes 40,000 contracts per month, extracting company names, dates, monetary values, and jurisdiction references from every document. Their Hugging Face Inference Endpoint running a fine-tuned RoBERTa NER model handles the work, but the economics are grim. Each contract averages 8,000 tokens, requiring multiple chunked API calls. At their current throughput of roughly 60 requests per second from the endpoint, processing a single contract takes 4-6 API round trips. Multiply that across 40,000 contracts and the monthly endpoint cost sits at $2,800 — for a model that fits comfortably on a single consumer-grade GPU. On a dedicated RTX 6000 Pro, the same pipeline processes the entire monthly volume in under 48 hours of compute time, at a fixed cost regardless of how many entities it extracts.

This guide covers the complete migration of NER workloads from HF Inference Endpoints to self-hosted GPU infrastructure.

HF Endpoints vs. Dedicated for NER

NER CapabilityHF Inference EndpointsDedicated GPU
Model optionsHF Hub token classification modelsAny model, including custom NER architectures
Entity aggregationBasic (simple/first/average/max)Custom aggregation with domain logic
Long document handling512 token limit per requestSliding window with overlap stitching
Batch throughput~100 docs/sec (single-request)~2,000+ docs/sec (batched, GPU-optimised)
Custom entity typesMust deploy new model versionHot-swap models, ensemble multiple NER heads
Post-processingLimited to API response formatCustom entity linking, normalisation, dedup

Building a Production NER Pipeline

Step 1: Deploy your NER model. On your GigaGPU dedicated server, load the same model you were running on HF Endpoints — or upgrade to a better one now that you have full GPU access:

pip install transformers torch accelerate fastapi uvicorn

from transformers import pipeline

# Use the same model from your HF Endpoint
ner = pipeline("ner",
    model="dslim/bert-large-NER",
    device=0,
    aggregation_strategy="max"
)

# Or upgrade to a more capable model
ner = pipeline("ner",
    model="Jean-Baptiste/camembert-ner-with-dates",
    device=0,
    aggregation_strategy="max"
)

Step 2: Handle long documents properly. HF Endpoints silently truncate inputs beyond 512 tokens. On dedicated hardware, implement a sliding window approach that processes entire documents without losing entities at chunk boundaries:

def extract_entities_long(text: str, chunk_size=450,
                          overlap=50) -> list:
    tokens = tokenizer.encode(text)
    all_entities = []
    for start in range(0, len(tokens), chunk_size - overlap):
        chunk_tokens = tokens[start:start + chunk_size]
        chunk_text = tokenizer.decode(chunk_tokens,
                                       skip_special_tokens=True)
        entities = ner(chunk_text)
        for ent in entities:
            ent["start"] += start  # Adjust offsets
            ent["end"] += start
        all_entities.extend(entities)
    return deduplicate_overlapping(all_entities)

Step 3: Build a batch processing API. Replace HF’s one-at-a-time API with a batch endpoint that processes dozens of documents per GPU forward pass:

from fastapi import FastAPI
from typing import List

app = FastAPI()

@app.post("/ner/batch")
async def batch_ner(documents: List[str]):
    # Process all documents in optimal batches
    results = ner(documents, batch_size=32)
    return {"results": [
        [{"entity": e["entity_group"], "word": e["word"],
          "score": round(e["score"], 4),
          "start": e["start"], "end": e["end"]}
         for e in doc_entities]
        for doc_entities in results
    ]}

Step 4: Add entity post-processing. This is where self-hosting truly shines — add custom logic that HF Endpoints cannot support, such as entity linking against your own knowledge base and domain-specific normalisation:

def enrich_entities(entities: list, knowledge_base: dict):
    enriched = []
    for ent in entities:
        canonical = knowledge_base.get(ent["word"].lower())
        enriched.append({
            **ent,
            "canonical_form": canonical or ent["word"],
            "in_knowledge_base": canonical is not None
        })
    return enriched

Throughput and Cost

NER models are relatively small (typically 330M-1B parameters), which means a single RTX 6000 Pro 96 GB can process enormous volumes. The GPU spends most of its time idle between API calls on HF Endpoints — on dedicated hardware with proper batching, utilisation stays above 80%.

Monthly DocumentsHF EndpointsDedicated GPUSavings
10,000~$85~$1,800HF cheaper
100,000~$850~$1,800HF cheaper
500,000~$4,250~$1,80058% savings
2,000,000~$17,000~$1,80089% savings

At scale, dedicated hardware is dramatically cheaper because NER models are small enough to leave massive GPU headroom for parallel processing. The GPU vs API cost comparison tool models these economics precisely.

Entity Extraction as a Core Competency

If your product depends on NER — whether that’s contract analysis, medical record parsing, or financial document processing — the extraction pipeline should be infrastructure you own, not an API you rent. Self-hosted NER on dedicated GPUs gives you the freedom to fine-tune models on domain data, chain NER with downstream LLM processing via vLLM, and keep sensitive documents on private infrastructure.

Browse open-source model hosting for deploying specialised NER models, check the LLM cost calculator for estimates, or explore the tutorials section and cost analysis guides for more migration patterns.

Extract Entities From Millions of Documents at Fixed Cost

Self-hosted NER on GigaGPU dedicated GPUs processes thousands of documents per second. No per-request fees, no truncation limits, full post-processing control.

Browse GPU Servers

Filed under: Tutorials

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?