RTX 3050 - Order Now
Home / Blog / Use Cases / Build Embedding API for Search on GPU
Use Cases

Build Embedding API for Search on GPU

Build a production embedding API for semantic search on a dedicated GPU server. Generate dense vector embeddings for text, images, and documents with batch processing and real-time indexing — no per-request fees or data leaving your infrastructure.

What You’ll Build

In 30 minutes, you will have a production embedding API that accepts text inputs and returns dense vector representations for semantic search, clustering, and retrieval-augmented generation. Running models like BGE-large or E5-large on a dedicated GPU server, your API generates 10,000 embeddings per second — powering search across millions of documents at zero per-request cost.

Cloud embedding APIs charge $0.02-$0.13 per million tokens. Building a semantic search index over 10 million documents means significant upfront embedding costs, and every document update triggers additional charges. Self-hosted embeddings on GPU hardware make it economical to re-index frequently, experiment with different models, and embed at scales that would be prohibitively expensive through cloud APIs.

Architecture Overview

The API serves embedding generation through a FastAPI service backed by a sentence-transformers model on GPU. Requests accept single texts or batches of up to 1,000 texts per call. The model generates fixed-dimension dense vectors (768 or 1024 dimensions depending on model choice) normalised for cosine similarity search. An optional integration layer pushes generated embeddings directly into a vector database like Qdrant, Milvus, or pgvector.

The API format mirrors the OpenAI embeddings endpoint, so existing RAG pipelines and search integrations work by changing the base URL. Pair with a vLLM inference server to build complete retrieval-augmented generation — embed documents, search for relevant context, and generate answers on the same GPU.

GPU Requirements

ModelRecommended GPUVRAMThroughput
BGE-large (335M)RTX 509024 GB~10k texts/sec
E5-large-v2 (335M)RTX 509024 GB~8k texts/sec
BGE-M3 (568M)RTX 6000 Pro40 GB~6k texts/sec

Embedding models are compact — most fit within 2-4GB VRAM, leaving room to co-host an LLM for complete RAG pipelines on a single card. Batch size is the primary throughput lever: larger batches saturate GPU compute more efficiently. See our self-hosted LLM guide for RAG architecture patterns.

Step-by-Step Build

Deploy the embedding model on your GPU server and build the API with batch support and OpenAI-compatible formatting.

from fastapi import FastAPI
from sentence_transformers import SentenceTransformer
import numpy as np

app = FastAPI()
model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")

@app.post("/v1/embeddings")
async def create_embeddings(input: list[str], model_name: str = "bge-large"):
    embeddings = model.encode(
        input,
        normalize_embeddings=True,
        batch_size=256,
        show_progress_bar=False
    )

    data = [
        {"object": "embedding", "index": i,
         "embedding": emb.tolist()}
        for i, emb in enumerate(embeddings)
    ]

    return {
        "object": "list",
        "data": data,
        "model": model_name,
        "usage": {"prompt_tokens": sum(len(t.split()) for t in input),
                  "total_tokens": sum(len(t.split()) for t in input)}
    }

@app.post("/v1/embeddings/index")
async def embed_and_index(texts: list[str], collection: str,
                          ids: list[str] = None):
    embeddings = model.encode(texts, normalize_embeddings=True,
                              batch_size=256)
    # Push to vector database
    upsert_to_vector_db(collection, ids, embeddings, texts)
    return {"indexed": len(texts), "collection": collection}

Add an indexing endpoint that generates embeddings and pushes them directly to your vector store in one call. For RAG pipelines, pair with an AI chatbot that retrieves relevant context from the vector store before generating answers. See production setup for high-throughput batch indexing patterns.

Search Quality and Optimisation

Different embedding models excel at different tasks. BGE-large leads on retrieval benchmarks for English. BGE-M3 handles multilingual search across 100+ languages. E5 models perform well on symmetric search where queries and documents have similar lengths. Test multiple models against your specific search queries to find the best match for your domain.

For domain-specific search, fine-tune the embedding model on your data using contrastive learning with query-document pairs from your search logs. A fine-tuned model typically improves retrieval accuracy by 10-20% on domain-specific queries compared to the general-purpose base model.

Deploy Your Embedding API

A self-hosted embedding API makes semantic search economical at any scale — embed millions of documents, re-index freely, and experiment with models without per-request fees. Power search, RAG, recommendations, and clustering on your own infrastructure. Launch on GigaGPU dedicated GPU hosting and start embedding. Browse more API use cases and tutorials in our library.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?