Home / Blog / Use Cases / Build Embedding API for Search on GPU

Use Cases

Build Embedding API for Search on GPU

Build a production embedding API for semantic search on a dedicated GPU server. Generate dense vector embeddings for text, images, and documents with batch processing and real-time indexing — no per-request fees or data leaving your infrastructure.

Use Cases April 16, 2026 3 min read gigagpu

What You’ll Build

In 30 minutes, you will have a production embedding API that accepts text inputs and returns dense vector representations for semantic search, clustering, and retrieval-augmented generation. Running models like BGE-large or E5-large on a dedicated GPU server, your API generates 10,000 embeddings per second — powering search across millions of documents at zero per-request cost.

Cloud embedding APIs charge $0.02-$0.13 per million tokens. Building a semantic search index over 10 million documents means significant upfront embedding costs, and every document update triggers additional charges. Self-hosted embeddings on GPU hardware make it economical to re-index frequently, experiment with different models, and embed at scales that would be prohibitively expensive through cloud APIs.

Architecture Overview

The API serves embedding generation through a FastAPI service backed by a sentence-transformers model on GPU. Requests accept single texts or batches of up to 1,000 texts per call. The model generates fixed-dimension dense vectors (768 or 1024 dimensions depending on model choice) normalised for cosine similarity search. An optional integration layer pushes generated embeddings directly into a vector database like Qdrant, Milvus, or pgvector.

The API format mirrors the OpenAI embeddings endpoint, so existing RAG pipelines and search integrations work by changing the base URL. Pair with a vLLM inference server to build complete retrieval-augmented generation — embed documents, search for relevant context, and generate answers on the same GPU.

GPU Requirements

Model	Recommended GPU	VRAM	Throughput
BGE-large (335M)	RTX 5090	24 GB	~10k texts/sec
E5-large-v2 (335M)	RTX 5090	24 GB	~8k texts/sec
BGE-M3 (568M)	RTX 6000 Pro	40 GB	~6k texts/sec

Embedding models are compact — most fit within 2-4GB VRAM, leaving room to co-host an LLM for complete RAG pipelines on a single card. Batch size is the primary throughput lever: larger batches saturate GPU compute more efficiently. See our self-hosted LLM guide for RAG architecture patterns.

Step-by-Step Build

Deploy the embedding model on your GPU server and build the API with batch support and OpenAI-compatible formatting.

from fastapi import FastAPI
from sentence_transformers import SentenceTransformer
import numpy as np

app = FastAPI()
model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")

@app.post("/v1/embeddings")
async def create_embeddings(input: list[str], model_name: str = "bge-large"):
    embeddings = model.encode(
        input,
        normalize_embeddings=True,
        batch_size=256,
        show_progress_bar=False
    )

    data = [
        {"object": "embedding", "index": i,
         "embedding": emb.tolist()}
        for i, emb in enumerate(embeddings)
    ]

    return {
        "object": "list",
        "data": data,
        "model": model_name,
        "usage": {"prompt_tokens": sum(len(t.split()) for t in input),
                  "total_tokens": sum(len(t.split()) for t in input)}
    }

@app.post("/v1/embeddings/index")
async def embed_and_index(texts: list[str], collection: str,
                          ids: list[str] = None):
    embeddings = model.encode(texts, normalize_embeddings=True,
                              batch_size=256)
    # Push to vector database
    upsert_to_vector_db(collection, ids, embeddings, texts)
    return {"indexed": len(texts), "collection": collection}

Add an indexing endpoint that generates embeddings and pushes them directly to your vector store in one call. For RAG pipelines, pair with an AI chatbot that retrieves relevant context from the vector store before generating answers. See production setup for high-throughput batch indexing patterns.

Search Quality and Optimisation

Different embedding models excel at different tasks. BGE-large leads on retrieval benchmarks for English. BGE-M3 handles multilingual search across 100+ languages. E5 models perform well on symmetric search where queries and documents have similar lengths. Test multiple models against your specific search queries to find the best match for your domain.

For domain-specific search, fine-tune the embedding model on your data using contrastive learning with query-document pairs from your search logs. A fine-tuned model typically improves retrieval accuracy by 10-20% on domain-specific queries compared to the general-purpose base model.

Deploy Your Embedding API

A self-hosted embedding API makes semantic search economical at any scale — embed millions of documents, re-index freely, and experiment with models without per-request fees. Power search, RAG, recommendations, and clustering on your own infrastructure. Launch on GigaGPU dedicated GPU hosting and start embedding. Browse more API use cases and tutorials in our library.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Use Cases

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Build Embedding API for Search on GPU

What You’ll Build

Architecture Overview

GPU Requirements

Step-by-Step Build

Search Quality and Optimisation

Deploy Your Embedding API

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Build Embedding API for Search on GPU

What You’ll Build

Architecture Overview

GPU Requirements

Step-by-Step Build

Search Quality and Optimisation

Deploy Your Embedding API

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 5060 Ti 16GB for AI Helpdesk

Mistral 7B for Content Writing & SEO: GPU Requirements & Setup

AI for Publishing & Media: Self-Hosted

Phi-3 for Content Writing & SEO: GPU Requirements & Setup

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?