Named Entity Recognition at Scale Is an Infrastructure Problem, Not an API Problem
A legal tech startup processes 40,000 contracts per month, extracting company names, dates, monetary values, and jurisdiction references from every document. Their Hugging Face Inference Endpoint running a fine-tuned RoBERTa NER model handles the work, but the economics are grim. Each contract averages 8,000 tokens, requiring multiple chunked API calls. At their current throughput of roughly 60 requests per second from the endpoint, processing a single contract takes 4-6 API round trips. Multiply that across 40,000 contracts and the monthly endpoint cost sits at $2,800 — for a model that fits comfortably on a single consumer-grade GPU. On a dedicated RTX 6000 Pro, the same pipeline processes the entire monthly volume in under 48 hours of compute time, at a fixed cost regardless of how many entities it extracts.
This guide covers the complete migration of NER workloads from HF Inference Endpoints to self-hosted GPU infrastructure.
HF Endpoints vs. Dedicated for NER
| NER Capability | HF Inference Endpoints | Dedicated GPU |
|---|---|---|
| Model options | HF Hub token classification models | Any model, including custom NER architectures |
| Entity aggregation | Basic (simple/first/average/max) | Custom aggregation with domain logic |
| Long document handling | 512 token limit per request | Sliding window with overlap stitching |
| Batch throughput | ~100 docs/sec (single-request) | ~2,000+ docs/sec (batched, GPU-optimised) |
| Custom entity types | Must deploy new model version | Hot-swap models, ensemble multiple NER heads |
| Post-processing | Limited to API response format | Custom entity linking, normalisation, dedup |
Building a Production NER Pipeline
Step 1: Deploy your NER model. On your GigaGPU dedicated server, load the same model you were running on HF Endpoints — or upgrade to a better one now that you have full GPU access:
pip install transformers torch accelerate fastapi uvicorn
from transformers import pipeline
# Use the same model from your HF Endpoint
ner = pipeline("ner",
model="dslim/bert-large-NER",
device=0,
aggregation_strategy="max"
)
# Or upgrade to a more capable model
ner = pipeline("ner",
model="Jean-Baptiste/camembert-ner-with-dates",
device=0,
aggregation_strategy="max"
)
Step 2: Handle long documents properly. HF Endpoints silently truncate inputs beyond 512 tokens. On dedicated hardware, implement a sliding window approach that processes entire documents without losing entities at chunk boundaries:
def extract_entities_long(text: str, chunk_size=450,
overlap=50) -> list:
tokens = tokenizer.encode(text)
all_entities = []
for start in range(0, len(tokens), chunk_size - overlap):
chunk_tokens = tokens[start:start + chunk_size]
chunk_text = tokenizer.decode(chunk_tokens,
skip_special_tokens=True)
entities = ner(chunk_text)
for ent in entities:
ent["start"] += start # Adjust offsets
ent["end"] += start
all_entities.extend(entities)
return deduplicate_overlapping(all_entities)
Step 3: Build a batch processing API. Replace HF’s one-at-a-time API with a batch endpoint that processes dozens of documents per GPU forward pass:
from fastapi import FastAPI
from typing import List
app = FastAPI()
@app.post("/ner/batch")
async def batch_ner(documents: List[str]):
# Process all documents in optimal batches
results = ner(documents, batch_size=32)
return {"results": [
[{"entity": e["entity_group"], "word": e["word"],
"score": round(e["score"], 4),
"start": e["start"], "end": e["end"]}
for e in doc_entities]
for doc_entities in results
]}
Step 4: Add entity post-processing. This is where self-hosting truly shines — add custom logic that HF Endpoints cannot support, such as entity linking against your own knowledge base and domain-specific normalisation:
def enrich_entities(entities: list, knowledge_base: dict):
enriched = []
for ent in entities:
canonical = knowledge_base.get(ent["word"].lower())
enriched.append({
**ent,
"canonical_form": canonical or ent["word"],
"in_knowledge_base": canonical is not None
})
return enriched
Throughput and Cost
NER models are relatively small (typically 330M-1B parameters), which means a single RTX 6000 Pro 96 GB can process enormous volumes. The GPU spends most of its time idle between API calls on HF Endpoints — on dedicated hardware with proper batching, utilisation stays above 80%.
| Monthly Documents | HF Endpoints | Dedicated GPU | Savings |
|---|---|---|---|
| 10,000 | ~$85 | ~$1,800 | HF cheaper |
| 100,000 | ~$850 | ~$1,800 | HF cheaper |
| 500,000 | ~$4,250 | ~$1,800 | 58% savings |
| 2,000,000 | ~$17,000 | ~$1,800 | 89% savings |
At scale, dedicated hardware is dramatically cheaper because NER models are small enough to leave massive GPU headroom for parallel processing. The GPU vs API cost comparison tool models these economics precisely.
Entity Extraction as a Core Competency
If your product depends on NER — whether that’s contract analysis, medical record parsing, or financial document processing — the extraction pipeline should be infrastructure you own, not an API you rent. Self-hosted NER on dedicated GPUs gives you the freedom to fine-tune models on domain data, chain NER with downstream LLM processing via vLLM, and keep sensitive documents on private infrastructure.
Browse open-source model hosting for deploying specialised NER models, check the LLM cost calculator for estimates, or explore the tutorials section and cost analysis guides for more migration patterns.
Extract Entities From Millions of Documents at Fixed Cost
Self-hosted NER on GigaGPU dedicated GPUs processes thousands of documents per second. No per-request fees, no truncation limits, full post-processing control.
Browse GPU ServersFiled under: Tutorials