Your Search Budget Doubled When Users Discovered the AI-Powered Results
Product launched the AI-enhanced search on a Tuesday. By Thursday, search queries were up 300% — users finally found what they were looking for on the first try, so they searched more often. Great for engagement metrics. Terrible for the budget. The Azure OpenAI embeddings and reranking pipeline behind the search upgrade cost $0.002 per query. At 50,000 daily queries, the original budget was $3,000 per month. At 200,000 daily queries, it hit $12,000. The growth team celebrated the engagement numbers. Finance asked why the AI budget quadrupled in a week. Nobody had modelled what happens when a search feature actually works well enough that people use it more.
Self-hosted search enhancement on a dedicated GPU inverts this problem: higher usage is free. Your monthly cost is fixed regardless of whether users send 50,000 or 5 million queries. Here’s how to migrate your search pipeline from Azure OpenAI.
Anatomy of an Azure OpenAI Search Pipeline
| Stage | Azure Component | Self-Hosted Alternative | Purpose |
|---|---|---|---|
| Query embedding | Azure OpenAI ada-002 | BGE / GTE / E5 models | Convert query to vector |
| Vector search | Azure AI Search | Keep or migrate to Qdrant/Weaviate | Find candidate results |
| Reranking | Azure OpenAI (GPT-3.5/4) | Cross-encoder or LLM reranker | Score result relevance |
| Answer generation | Azure OpenAI GPT-4 | Llama 3.1 70B via vLLM | Generate search answer |
| Query understanding | Azure OpenAI GPT-3.5 | Llama 3.1 8B | Parse intent, expand query |
Step-by-Step Migration
Step 1: Identify the expensive stages. In most search pipelines, the embedding and reranking stages handle the highest volume. If every query generates an embedding (high volume, low cost per call) and every result set gets reranked by a GPT model (medium volume, higher cost per call), the reranking stage typically dominates your Azure bill.
Step 2: Provision your GPU. An RTX 6000 Pro 96 GB from GigaGPU can run your embedding model, reranking model, and answer generation model simultaneously. Search enhancement is one of the best multi-model use cases for a single powerful GPU.
Step 3: Deploy the embedding model. Replace Azure OpenAI embeddings with a self-hosted model. BGE-large-en-v1.5 matches ada-002 quality on most benchmarks. Deploy via TEI (Text Embeddings Inference) for maximum throughput:
docker run --gpus all -p 8080:80 \
ghcr.io/huggingface/text-embeddings-inference:latest \
--model-id BAAI/bge-large-en-v1.5
Step 4: Deploy the reranker and LLM. Use vLLM for the answer generation model. For reranking, a cross-encoder model (like BGE-reranker-v2-m3) is far more efficient than using a full LLM — it scores relevance directly without generating text.
Step 5: Re-index your content. If you’re changing embedding models (from ada-002 to BGE), you need to re-embed your entire document corpus. On dedicated hardware, this is fast — a million documents in under an hour on an RTX 6000 Pro. Your vector database (whether you keep Azure AI Search or switch to Qdrant) just needs the new vectors.
Step 6: A/B test search quality. Run parallel search pipelines for one week. Compare click-through rates, search refinement rates (fewer refinements = better results), and time-to-click. Open-source models typically match or exceed Azure OpenAI for search-specific tasks.
Search-Specific Optimisations
Search queries demand low latency — users expect results in under 500ms. Self-hosting gives you architectural advantages Azure can’t match:
- Co-located models: Your embedding, reranking, and generation models all run on the same machine. No network hops between services. Pipeline latency drops from 800ms to under 200ms.
- Hybrid search: Combine vector search with keyword search locally. On Azure, this requires coordinating between multiple services.
- Custom reranking: Fine-tune your reranker on click-through data from your specific search logs. This is the single highest-impact improvement for search quality.
Cost Comparison
| Daily Queries | Azure OpenAI Pipeline | GigaGPU Self-Hosted | Monthly Savings |
|---|---|---|---|
| 50K | ~$3,000/month | ~$1,800/month | $1,200 |
| 200K | ~$12,000/month | ~$1,800/month | $10,200 |
| 500K | ~$30,000/month | ~$1,800/month | $28,200 |
| 1M+ | ~$60,000+/month | ~$3,600/month (2x RTX 6000 Pro) | $56,400+ |
Model the exact crossover at GPU vs API cost comparison.
Build Search That Gets Better With Scale
The irony of per-query pricing for search is that success costs more. Self-hosting flips this: as your search gets better and usage grows, your per-query cost drops towards zero. That’s an economic model that aligns with your product goals rather than fighting them.
Related guides include the Azure copilot migration and OpenAI embeddings migration. The breakeven analysis quantifies the economics, while open-source LLM hosting covers model options. Use the LLM cost calculator for budgeting, and browse more guides in our tutorials section.
Search Enhancement at a Fixed Price
Embedding, reranking, and AI answers — all from a single dedicated GPU at a flat monthly rate. GigaGPU lets your search scale without your costs scaling alongside it.
Browse GPU ServersFiled under: Tutorials