RTX 3050 - Order Now
Home / Blog / Tutorials / Migrate from OpenAI to Self-Hosted: Embeddings Pipeline Guide
Tutorials

Migrate from OpenAI to Self-Hosted: Embeddings Pipeline Guide

Move your OpenAI embeddings pipeline to a self-hosted GPU and slash per-vector costs to near zero while processing millions of documents without rate limits.

The Embedding Tax Nobody Talks About

You probably didn’t notice it at first. At $0.13 per million tokens, OpenAI’s text-embedding-3-small seemed almost free. Then your RAG pipeline grew. You started re-embedding documents weekly for freshness. You added a second collection for customer tickets. A third for product descriptions. Suddenly you’re pushing 200 million tokens through the embeddings endpoint every month, and the bill reads $26 — still manageable. But now your team wants to embed your entire 10-million-document archive, and the maths changes fast: that’s a one-time cost of over $500 just to build the initial index, plus ongoing costs every time you update it.

More critically, rate limits on the embeddings endpoint cap you at 10,000 requests per minute on the highest tier. When you’re batch-processing a backlog, that bottleneck turns a four-hour job into a multi-day crawl. Self-hosting your embeddings model on a dedicated GPU removes both the cost ceiling and the throughput wall. Here’s how to make the switch.

Prerequisites and Model Selection

Embedding models are far smaller than chat models, so your hardware requirements are modest. A single NVIDIA A30 or even an RTX 5090 can serve a production-grade embeddings pipeline for millions of documents per day.

OpenAI ModelOpen-Source EquivalentDimensionsGPU Memory Needed
text-embedding-3-smallBGE-base-en-v1.5768~1 GB
text-embedding-3-largeBGE-large-en-v1.5 / E5-large-v21024~2 GB
text-embedding-ada-002GTE-large / Instructor-XL1536 / 768~2-4 GB

Before migrating, export a sample of 1,000 documents with their existing OpenAI embeddings. You’ll use these as a benchmark to validate that your self-hosted model produces comparable retrieval quality. Capture your current recall@10 and MRR metrics from your vector database queries.

For most English-language RAG workloads, BGE-large-en-v1.5 is the drop-in choice — it consistently matches or beats text-embedding-ada-002 on the MTEB leaderboard while being fully open source under MIT licence.

Step-by-Step Migration

Step 1: Provision a GPU server. Grab a GigaGPU dedicated server with at least 24 GB of VRAM. Even a single A30 is overkill for embedding-only workloads — you’ll have headroom for serving a chat model alongside it.

Step 2: Deploy the embedding model. The fastest route is using the TEI (Text Embeddings Inference) server from Hugging Face, which is optimised for throughput:

docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-embeddings-inference:latest \
  --model-id BAAI/bge-large-en-v1.5 \
  --max-batch-tokens 16384

Step 3: Adapt your client code. TEI exposes an /embed endpoint. If your codebase uses the OpenAI Python client, you can either write a lightweight adapter or use LangChain’s HuggingFaceEmbeddings class which supports TEI natively:

from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs={"device": "cuda"}
)

Step 4: Re-embed your document corpus. Batch-process your entire collection using the self-hosted model. On an RTX 6000 Pro, expect throughput of 3,000-5,000 documents per second for typical 512-token passages — your 10-million-document archive completes in under an hour.

Step 5: Validate retrieval quality. Run your benchmark queries against the new embeddings and compare recall@10. Adjust chunking strategy if needed — self-hosted models let you experiment freely without per-token costs.

Handling API Compatibility

If you need exact OpenAI API compatibility (because your vector database or orchestration layer talks to the /v1/embeddings endpoint), you can use vLLM or a lightweight proxy like LiteLLM to translate requests. This means zero changes to your existing codebase — swap the base URL and you’re done.

One important difference: OpenAI normalises embedding vectors automatically. Most open-source models also return normalised vectors, but verify this with your chosen model. If not, add an L2 normalisation step before inserting into your vector store.

For teams running their entire AI stack on-prem, consider hosting both your embedding model and your chat model on the same private AI server — a single RTX 6000 Pro 96 GB can comfortably serve both simultaneously.

Cost and Throughput Comparison

MetricOpenAI EmbeddingsSelf-Hosted (BGE-large on A30)
Cost per 1M tokens$0.13~$0.002 (amortised server cost)
Monthly cost at 500M tokens$65~$800 server / unlimited tokens
Monthly cost at 5B tokens$650~$800 server / unlimited tokens
Max throughput10,000 RPM50,000+ embeddings/sec
Batch re-indexing (10M docs)~16 hours (rate limited)~45 minutes
Data privacySent to OpenAINever leaves your server

The crossover point is clear: if you’re processing more than a few billion tokens per month, self-hosting pays for itself almost immediately. For smaller workloads, the real value is throughput — being able to re-index your entire corpus in minutes rather than waiting all day. Compare the numbers for your specific use case with the GPU vs API cost comparison tool.

Next Steps

Once your embeddings pipeline is running on dedicated hardware, you’ll likely want to self-host your LLM as well — after all, the embedding model is the cheapest component of a RAG stack. Check out our complete self-hosting guide for the chat model side, or read the breakeven analysis to see when your entire pipeline should move off API pricing. If you’re coming from OpenAI specifically, the OpenAI API alternative page lays out the full comparison, and our tutorials section has more migration walkthroughs including chatbot API migration.

Unlimited Embeddings, Fixed Monthly Price

Stop paying per token for embeddings. A GigaGPU dedicated GPU processes billions of tokens per month at a flat rate — with zero rate limits.

Browse GPU Servers

Filed under: Tutorials

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?