The Embedding Tax Nobody Talks About
You probably didn’t notice it at first. At $0.13 per million tokens, OpenAI’s text-embedding-3-small seemed almost free. Then your RAG pipeline grew. You started re-embedding documents weekly for freshness. You added a second collection for customer tickets. A third for product descriptions. Suddenly you’re pushing 200 million tokens through the embeddings endpoint every month, and the bill reads $26 — still manageable. But now your team wants to embed your entire 10-million-document archive, and the maths changes fast: that’s a one-time cost of over $500 just to build the initial index, plus ongoing costs every time you update it.
More critically, rate limits on the embeddings endpoint cap you at 10,000 requests per minute on the highest tier. When you’re batch-processing a backlog, that bottleneck turns a four-hour job into a multi-day crawl. Self-hosting your embeddings model on a dedicated GPU removes both the cost ceiling and the throughput wall. Here’s how to make the switch.
Prerequisites and Model Selection
Embedding models are far smaller than chat models, so your hardware requirements are modest. A single NVIDIA A30 or even an RTX 5090 can serve a production-grade embeddings pipeline for millions of documents per day.
| OpenAI Model | Open-Source Equivalent | Dimensions | GPU Memory Needed |
|---|---|---|---|
| text-embedding-3-small | BGE-base-en-v1.5 | 768 | ~1 GB |
| text-embedding-3-large | BGE-large-en-v1.5 / E5-large-v2 | 1024 | ~2 GB |
| text-embedding-ada-002 | GTE-large / Instructor-XL | 1536 / 768 | ~2-4 GB |
Before migrating, export a sample of 1,000 documents with their existing OpenAI embeddings. You’ll use these as a benchmark to validate that your self-hosted model produces comparable retrieval quality. Capture your current recall@10 and MRR metrics from your vector database queries.
For most English-language RAG workloads, BGE-large-en-v1.5 is the drop-in choice — it consistently matches or beats text-embedding-ada-002 on the MTEB leaderboard while being fully open source under MIT licence.
Step-by-Step Migration
Step 1: Provision a GPU server. Grab a GigaGPU dedicated server with at least 24 GB of VRAM. Even a single A30 is overkill for embedding-only workloads — you’ll have headroom for serving a chat model alongside it.
Step 2: Deploy the embedding model. The fastest route is using the TEI (Text Embeddings Inference) server from Hugging Face, which is optimised for throughput:
docker run --gpus all -p 8080:80 \
ghcr.io/huggingface/text-embeddings-inference:latest \
--model-id BAAI/bge-large-en-v1.5 \
--max-batch-tokens 16384
Step 3: Adapt your client code. TEI exposes an /embed endpoint. If your codebase uses the OpenAI Python client, you can either write a lightweight adapter or use LangChain’s HuggingFaceEmbeddings class which supports TEI natively:
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="BAAI/bge-large-en-v1.5",
model_kwargs={"device": "cuda"}
)
Step 4: Re-embed your document corpus. Batch-process your entire collection using the self-hosted model. On an RTX 6000 Pro, expect throughput of 3,000-5,000 documents per second for typical 512-token passages — your 10-million-document archive completes in under an hour.
Step 5: Validate retrieval quality. Run your benchmark queries against the new embeddings and compare recall@10. Adjust chunking strategy if needed — self-hosted models let you experiment freely without per-token costs.
Handling API Compatibility
If you need exact OpenAI API compatibility (because your vector database or orchestration layer talks to the /v1/embeddings endpoint), you can use vLLM or a lightweight proxy like LiteLLM to translate requests. This means zero changes to your existing codebase — swap the base URL and you’re done.
One important difference: OpenAI normalises embedding vectors automatically. Most open-source models also return normalised vectors, but verify this with your chosen model. If not, add an L2 normalisation step before inserting into your vector store.
For teams running their entire AI stack on-prem, consider hosting both your embedding model and your chat model on the same private AI server — a single RTX 6000 Pro 96 GB can comfortably serve both simultaneously.
Cost and Throughput Comparison
| Metric | OpenAI Embeddings | Self-Hosted (BGE-large on A30) |
|---|---|---|
| Cost per 1M tokens | $0.13 | ~$0.002 (amortised server cost) |
| Monthly cost at 500M tokens | $65 | ~$800 server / unlimited tokens |
| Monthly cost at 5B tokens | $650 | ~$800 server / unlimited tokens |
| Max throughput | 10,000 RPM | 50,000+ embeddings/sec |
| Batch re-indexing (10M docs) | ~16 hours (rate limited) | ~45 minutes |
| Data privacy | Sent to OpenAI | Never leaves your server |
The crossover point is clear: if you’re processing more than a few billion tokens per month, self-hosting pays for itself almost immediately. For smaller workloads, the real value is throughput — being able to re-index your entire corpus in minutes rather than waiting all day. Compare the numbers for your specific use case with the GPU vs API cost comparison tool.
Next Steps
Once your embeddings pipeline is running on dedicated hardware, you’ll likely want to self-host your LLM as well — after all, the embedding model is the cheapest component of a RAG stack. Check out our complete self-hosting guide for the chat model side, or read the breakeven analysis to see when your entire pipeline should move off API pricing. If you’re coming from OpenAI specifically, the OpenAI API alternative page lays out the full comparison, and our tutorials section has more migration walkthroughs including chatbot API migration.
Unlimited Embeddings, Fixed Monthly Price
Stop paying per token for embeddings. A GigaGPU dedicated GPU processes billions of tokens per month at a flat rate — with zero rate limits.
Browse GPU ServersFiled under: Tutorials