Embedding 10 million documents (averaging 500 tokens each) through OpenAI’s text-embedding-3-large costs $1,300 in a single batch. Running the same job with BGE-large-en-v1.5 on a dedicated RTX 5090 costs approximately $14 in GPU time. For RAG systems, semantic search engines, and recommendation pipelines that re-embed regularly, this cost difference compounds into tens of thousands of pounds annually.
Embedding Cost Drivers
Three factors determine embedding cost: the number of documents, average token length per document, and how frequently you re-embed. Initial corpus embedding is a one-time cost, but production systems re-embed on every document update, run nightly re-indexing jobs, and process real-time ingestion streams. A system ingesting 50,000 new documents daily at API rates racks up $195 per month on embeddings alone — before any retrieval or generation costs. Understanding cost per million tokens helps quantify this overhead.
Embedding Cost per Million Documents
| Model | Deployment | Dimensions | Cost per 1M Docs | Throughput (docs/sec) |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | API | 3,072 | $130.00 | ~500 |
| OpenAI text-embedding-3-small | API | 1,536 | $20.00 | ~800 |
| Cohere embed-english-v3 | API | 1,024 | $100.00 | ~400 |
| BGE-large-en-v1.5 | RTX 5090 | 1,024 | $1.40 | 1,200 |
| BGE-large-en-v1.5 | RTX 6000 Pro 96 GB | 1,024 | $2.10 | 2,800 |
| E5-large-v2 | RTX 5090 | 1,024 | $1.25 | 1,350 |
| BGE-small-en-v1.5 | RTX 5090 | 384 | $0.45 | 3,800 |
| GTE-large | RTX 5090 | 1,024 | $1.50 | 1,100 |
Self-hosted costs at GigaGPU monthly rates. Document length averaged at 500 tokens.
Break-Even Analysis
At the smallest scale — a one-time embedding of 100,000 documents — the API is cheaper because you avoid the monthly GPU commitment. But the break-even threshold arrives fast. If you embed more than 2 million documents per month (including re-indexing and new ingestion), a dedicated RTX 5090 is already cheaper than OpenAI’s small embedding model. Against the large model, break-even occurs at just 300,000 documents per month.
The total cost of ownership analysis confirms that sustained embedding workloads favour dedicated infrastructure within the first billing cycle.
Optimising Embedding Throughput
Maximising documents per second directly reduces amortised cost. Batch size is the primary lever — embedding models process batches of 32-128 documents simultaneously on GPU. Using ONNX Runtime instead of raw PyTorch boosts throughput by 30-50% on the same hardware. Quantised embedding models (INT8) deliver nearly identical retrieval quality at 2x the throughput. The cheapest GPU for embeddings is often an RTX 5090 because embedding models rarely exceed 2GB VRAM, leaving headroom for massive batch sizes.
Real-World Embedding Scenarios
| Use Case | Docs/Month | API Cost/Month | Self-Hosted/Month | Annual Savings |
|---|---|---|---|---|
| SaaS knowledge base | 500K | $65 | $180 (GPU rental) | -$1,380 |
| E-commerce search | 5M | $650 | $180 | $5,640 |
| Legal document search | 20M | $2,600 | $180 | $29,040 |
| Enterprise RAG platform | 100M | $13,000 | $540 (3x RTX 5090) | $149,520 |
API costs based on OpenAI text-embedding-3-large. Self-hosted GPU is shared with other workloads at low volumes.
Deploy Embeddings on GigaGPU
Run your embedding pipeline on GigaGPU dedicated GPU hosting with zero per-token charges and unlimited throughput. Deploy BGE, E5, or any Sentence Transformers model alongside your LLM inference stack on the same server for a complete RAG pipeline.
Estimate your embedding spend with the LLM cost calculator, compare architectures with the GPU vs API comparison, or explore open-source hosting for turnkey deployments. Data-sensitive workloads benefit from private AI hosting with UK-based isolation. Find more cost analyses on the cost blog.