Embeddings are the infrastructure layer of modern AI: every retrieval, clustering, deduplication, semantic search and memory feature burns through them. Hosting an OpenAI-compatible embedding endpoint on the RTX 5060 Ti 16GB via UK dedicated GPU hosting delivers 10,200 BGE-base texts per second on a single Blackwell card – enough to saturate the ingestion side of every RAG system we typically see.
Contents
- Model choice
- Throughput table
- OpenAI-compatible interface
- Cost vs ada-002 and competitors
- Operational notes
Model choice
| Model | Dim | VRAM (FP16) | MTEB avg | Notes |
|---|---|---|---|---|
| BGE-base-en-v1.5 | 768 | 0.9 GB | 63.6 | Sweet spot throughput/quality |
| BGE-M3 | 1024 | 2.3 GB | – | Multilingual, dense + sparse + ColBERT |
| Nomic Embed v1.5 | 768 (Matryoshka) | 1.1 GB | 62.4 | Truncatable to 64/128/256 dims |
| mxbai-embed-large | 1024 | 1.4 GB | 64.7 | Top English MTEB |
| Qwen3-Embedding-0.6B | 1024 | 1.3 GB | 65.9 | Strong multilingual |
Throughput table
| Model | Texts/sec | Daily (16 h) | p99 latency |
|---|---|---|---|
| BGE-base-en-v1.5 | 10,200 | 587M | 14 ms |
| BGE-M3 dense | 3,400 | 195M | 28 ms |
| Nomic Embed v1.5 | 8,800 | 506M | 16 ms |
| mxbai-embed-large | 4,600 | 264M | 22 ms |
See embedding throughput benchmark for the tuning profile behind these numbers.
OpenAI-compatible interface
TEI or Infinity expose a /v1/embeddings endpoint compatible with the OpenAI Python SDK. Point your existing code at the dedicated URL by changing base_url and model; no client logic changes.
from openai import OpenAI
client = OpenAI(base_url="https://embed.example.com/v1", api_key="...")
resp = client.embeddings.create(
model="bge-base-en-v1.5",
input=["text one", "text two"],
)
vec = resp.data[0].embedding
Cost vs ada-002 and competitors
| Provider | Price | 1B tokens/mo | 10B tokens/mo |
|---|---|---|---|
| OpenAI text-embedding-3-small | $0.02/M tok | £16 | £158 |
| OpenAI text-embedding-3-large | $0.13/M tok | £103 | £1,030 |
| Voyage voyage-3 | $0.06/M tok | £47 | £474 |
| Cohere embed-v3 | $0.10/M tok | £79 | £790 |
| Self-hosted 5060 Ti | Fixed | Fixed monthly | Fixed monthly |
OpenAI’s text-embedding-3-small is genuinely cheap – below a few billion tokens/month, its API is hard to beat on pure price. Self-hosting wins decisively at 10B+ tokens/month and for anything requiring UK data residency, custom model choice, multilingual coverage (BGE-M3) or rate-limit-free batch indexing.
Operational notes
Batch incoming requests to 64-128 texts, enable dynamic padding in TEI, pin the process to the GPU via CUDA_VISIBLE_DEVICES, and monitor tei_request_duration_seconds in Prometheus. Co-locate a BGE-reranker-base (3,200 pairs/sec) on the same card for a complete retrieval tier – see our reranker server.
Embedding API on Blackwell 16GB
10,200 texts/sec OpenAI-compatible. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: reranker throughput, reranker API, SaaS RAG, vLLM setup.