What You’ll Build
In 30 minutes, you will have a production embedding API that accepts text inputs and returns dense vector representations for semantic search, clustering, and retrieval-augmented generation. Running models like BGE-large or E5-large on a dedicated GPU server, your API generates 10,000 embeddings per second — powering search across millions of documents at zero per-request cost.
Cloud embedding APIs charge $0.02-$0.13 per million tokens. Building a semantic search index over 10 million documents means significant upfront embedding costs, and every document update triggers additional charges. Self-hosted embeddings on GPU hardware make it economical to re-index frequently, experiment with different models, and embed at scales that would be prohibitively expensive through cloud APIs.
Architecture Overview
The API serves embedding generation through a FastAPI service backed by a sentence-transformers model on GPU. Requests accept single texts or batches of up to 1,000 texts per call. The model generates fixed-dimension dense vectors (768 or 1024 dimensions depending on model choice) normalised for cosine similarity search. An optional integration layer pushes generated embeddings directly into a vector database like Qdrant, Milvus, or pgvector.
The API format mirrors the OpenAI embeddings endpoint, so existing RAG pipelines and search integrations work by changing the base URL. Pair with a vLLM inference server to build complete retrieval-augmented generation — embed documents, search for relevant context, and generate answers on the same GPU.
GPU Requirements
| Model | Recommended GPU | VRAM | Throughput |
|---|---|---|---|
| BGE-large (335M) | RTX 5090 | 24 GB | ~10k texts/sec |
| E5-large-v2 (335M) | RTX 5090 | 24 GB | ~8k texts/sec |
| BGE-M3 (568M) | RTX 6000 Pro | 40 GB | ~6k texts/sec |
Embedding models are compact — most fit within 2-4GB VRAM, leaving room to co-host an LLM for complete RAG pipelines on a single card. Batch size is the primary throughput lever: larger batches saturate GPU compute more efficiently. See our self-hosted LLM guide for RAG architecture patterns.
Step-by-Step Build
Deploy the embedding model on your GPU server and build the API with batch support and OpenAI-compatible formatting.
from fastapi import FastAPI
from sentence_transformers import SentenceTransformer
import numpy as np
app = FastAPI()
model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
@app.post("/v1/embeddings")
async def create_embeddings(input: list[str], model_name: str = "bge-large"):
embeddings = model.encode(
input,
normalize_embeddings=True,
batch_size=256,
show_progress_bar=False
)
data = [
{"object": "embedding", "index": i,
"embedding": emb.tolist()}
for i, emb in enumerate(embeddings)
]
return {
"object": "list",
"data": data,
"model": model_name,
"usage": {"prompt_tokens": sum(len(t.split()) for t in input),
"total_tokens": sum(len(t.split()) for t in input)}
}
@app.post("/v1/embeddings/index")
async def embed_and_index(texts: list[str], collection: str,
ids: list[str] = None):
embeddings = model.encode(texts, normalize_embeddings=True,
batch_size=256)
# Push to vector database
upsert_to_vector_db(collection, ids, embeddings, texts)
return {"indexed": len(texts), "collection": collection}
Add an indexing endpoint that generates embeddings and pushes them directly to your vector store in one call. For RAG pipelines, pair with an AI chatbot that retrieves relevant context from the vector store before generating answers. See production setup for high-throughput batch indexing patterns.
Search Quality and Optimisation
Different embedding models excel at different tasks. BGE-large leads on retrieval benchmarks for English. BGE-M3 handles multilingual search across 100+ languages. E5 models perform well on symmetric search where queries and documents have similar lengths. Test multiple models against your specific search queries to find the best match for your domain.
For domain-specific search, fine-tune the embedding model on your data using contrastive learning with query-document pairs from your search logs. A fine-tuned model typically improves retrieval accuracy by 10-20% on domain-specific queries compared to the general-purpose base model.
Deploy Your Embedding API
A self-hosted embedding API makes semantic search economical at any scale — embed millions of documents, re-index freely, and experiment with models without per-request fees. Power search, RAG, recommendations, and clustering on your own infrastructure. Launch on GigaGPU dedicated GPU hosting and start embedding. Browse more API use cases and tutorials in our library.