A modern search backend combines lexical BM25, dense vector retrieval and a neural reranker, optionally topped with an LLM-generated answer. The whole stack fits on the RTX 5060 Ti 16GB at our UK dedicated GPU hosting: TEI (Text Embeddings Inference) for the embedder, Qdrant for the HNSW index, BGE reranker for precision and Llama 3 8B FP8 for answers. Blackwell’s 4608 CUDA, 16 GB GDDR7 and native FP8 deliver around 10,000 BGE-base embeddings per second and 112 t/s Llama generation on one card.
Contents
Stack
| Component | Role | Host |
|---|---|---|
| TEI | BGE-M3 or BGE-base embeddings as a service | GPU (5060 Ti) |
| Qdrant or Weaviate | HNSW vector index + metadata filter | CPU + SSD |
| OpenSearch or Meilisearch | BM25 lexical channel | CPU + SSD |
| BGE reranker v2 | Cross-encoder rerank of top-50 candidates | GPU (5060 Ti) |
| Llama 3.1 8B FP8 (optional) | Synthesised answer with citations | GPU (5060 Ti) |
Indexing
| Embedder | Throughput (5060 Ti) | 100M docs = |
|---|---|---|
| BGE-base (768-dim) | ~10,000 texts/s batched | ~2.8 h of GPU wall time |
| BGE-M3 (1024-dim, multilingual) | ~5,000 texts/s batched | ~5.6 h |
| BGE-small (384-dim) | ~20,000 texts/s batched | ~1.4 h |
Storage dominates at scale. 100M 1024-dim vectors at float16 are ~200 GB raw. Use int8 quantisation in Qdrant (4x smaller, ~2 percent recall loss) to fit on commodity NVMe.
Query latency
| Stage | Latency |
|---|---|
| Query embed (BGE-base) | 5 ms |
| BM25 top-50 | 10-30 ms |
| HNSW top-50 | 5-20 ms |
| Reciprocal rank fusion | <1 ms |
| BGE rerank top-50 -> top-10 | 30-60 ms |
| Total (search only) | ~60-120 ms |
| + LLM answer (300 tokens) | +2-3 s |
Hybrid retrieval
BM25 nails exact-phrase and jargon queries; dense vectors handle paraphrase. Combine them with Reciprocal Rank Fusion (k=60) before reranking. On typical mixed workloads RRF gives 10-15 percent better nDCG@10 than either channel alone. See our hybrid search guide.
AI answers
For “search + answer” UIs, pass the top-5 reranked chunks to Llama 3.1 8B FP8 with a citation-forcing prompt. Adds ~2 seconds of latency and Blackwell’s 720 t/s aggregate throughput lets you run dozens of concurrent answer streams on one card.
Full search stack on one card
TEI, Qdrant, reranker, LLM on Blackwell 16GB. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: embedding throughput, RAG stack install, document Q&A, SaaS RAG, classification.