Each RAG Query Hits Together.ai Twice — and You’re Paying Both Times
A knowledge management platform processes 80,000 RAG queries daily. Each query makes two Together.ai API calls: one to the embedding endpoint for query vectorisation, and one to the LLM endpoint for answer generation. At their token volumes — roughly 50 million embedding tokens and 120 million generation tokens per month — the combined bill reached $4,100 monthly. But cost wasn’t the only pain. The two sequential API calls added 800-1200ms of network latency per query on top of model inference time. For users expecting instant answers from their internal knowledge base, the 2-3 second total response time felt sluggish. And during Together.ai’s occasional rate-limiting periods, query latency spiked to 8-10 seconds, triggering a wave of support tickets.
A RAG pipeline is inherently a multi-model workflow: embedding, retrieval, reranking, and generation all happen in sequence. When these components run on the same dedicated GPU server, inter-step latency drops from hundreds of milliseconds to microseconds, and the per-query cost drops to effectively zero.
Together.ai RAG vs. Self-Hosted RAG
| RAG Component | Together.ai Approach | Dedicated GPU Approach |
|---|---|---|
| Query embedding | API call (~200ms network + inference) | Local inference (~5ms) |
| Document retrieval | Your vector DB (same either way) | Your vector DB (co-located) |
| Reranking | API call or skip (~200ms) | Local inference (~10ms) |
| Answer generation | API call (~600-1000ms) | Local vLLM (~200-400ms TTFT) |
| Total pipeline latency | 1.5-3.0 seconds | 0.4-0.8 seconds |
| Per-query cost (est.) | $0.0017 | $0 marginal |
Architecture for Self-Hosted RAG
The self-hosted RAG pipeline consolidates three Together.ai API calls into local model inference on a single server. Here’s the architecture:
Step 1: Deploy your models. On a GigaGPU dedicated server with an RTX 6000 Pro 96 GB, you can run all RAG components simultaneously:
# Component 1: Embedding model (runs on ~2GB VRAM)
# Using sentence-transformers for query/document embedding
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda:0")
# Component 2: Reranker (runs on ~1GB VRAM)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", device="cuda:0")
# Component 3: LLM via vLLM (uses remaining VRAM)
# Launch separately:
# python -m vllm.entrypoints.openai.api_server \
# --model meta-llama/Llama-3.1-70B-Instruct \
# --gpu-memory-utilization 0.6 --port 8001
Step 2: Co-locate your vector database. Run your vector store (Qdrant, Milvus, Weaviate, or pgvector) on the same server or on a companion server in the same data centre. Keeping the vector DB co-located with the inference stack eliminates the network round-trip for retrieval:
# Qdrant running locally
docker run -d -p 6333:6333 -v /data/qdrant:/qdrant/storage qdrant/qdrant
# Full RAG query — all local
import requests, httpx
def rag_query(question: str):
# Step 1: Embed query (local, ~5ms)
query_vec = embedder.encode(question).tolist()
# Step 2: Retrieve documents (local Qdrant, ~10ms)
hits = requests.post("http://localhost:6333/collections/docs/points/search",
json={"vector": query_vec, "limit": 20}).json()
# Step 3: Rerank (local, ~10ms)
pairs = [(question, hit["payload"]["text"]) for hit in hits["result"]]
scores = reranker.predict(pairs)
top_docs = [pairs[i][1] for i in scores.argsort()[-5:]]
# Step 4: Generate answer (local vLLM, ~200-400ms)
context = "\n\n".join(top_docs)
response = httpx.post("http://localhost:8001/v1/chat/completions",
json={"model": "meta-llama/Llama-3.1-70B-Instruct",
"messages": [{"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"}],
"max_tokens": 512, "stream": True})
return response
Step 3: Migrate your application layer. Update your RAG orchestration code (LangChain, LlamaIndex, or custom) to point at local endpoints instead of Together.ai’s API. If you’re using vLLM’s OpenAI-compatible endpoint, the LLM portion requires only a base URL change.
Step 4: Rebuild your index. If switching embedding models (e.g., from Together.ai’s hosted model to a locally-run BGE variant), re-embed your document corpus using the local embedding model. This is a one-time operation — run it as a batch job on the same GPU.
Latency Breakdown
The performance improvement is dramatic because RAG pipelines are latency-sensitive and involve multiple sequential steps. Each step that previously required a network round-trip to Together.ai now runs locally:
- Embedding latency: 200ms (API) drops to 5ms (local). Savings: 195ms per query.
- Reranking latency: 200ms (API) drops to 10ms (local). Savings: 190ms per query.
- Generation TTFT: 600-1000ms (API, shared infra) drops to 200-400ms (dedicated GPU). Savings: 400-600ms.
- Total improvement: 60-75% faster end-to-end query processing.
For users of your knowledge base, this means answers appear in under a second instead of 2-3 seconds. That difference transforms the perceived quality of the product. Host your full open-source model stack on dedicated hardware for the fastest possible RAG experience.
Cost Comparison
| Monthly Query Volume | Together.ai Monthly | GigaGPU Monthly | Latency Improvement |
|---|---|---|---|
| 10,000 queries/day | ~$510 | ~$1,800 | 60-75% faster |
| 50,000 queries/day | ~$2,550 | ~$1,800 | 60-75% faster |
| 80,000 queries/day | ~$4,100 | ~$1,800 | 60-75% faster |
| 200,000 queries/day | ~$10,200 | ~$3,600 (2x RTX 6000 Pro) | 60-75% faster |
The breakeven is approximately 35,000 queries per day. Above that, dedicated hardware is both cheaper and faster. The GPU vs API cost comparison tool models your exact query patterns.
RAG That Feels Instant
When every RAG component runs on the same hardware, the pipeline stops feeling like a chain of API calls and starts feeling like a single operation. Users get faster answers, you pay less per query, and your data never leaves your infrastructure — critical for knowledge bases containing private or sensitive information.
More resources: the Together.ai alternative comparison, the LLM cost calculator, and our tutorials section. The cost analysis section has deeper economic comparisons across providers.
Sub-Second RAG on Your Own Hardware
Consolidate your entire RAG pipeline — embedding, reranking, and generation — on a single GigaGPU dedicated server. Faster queries, zero per-token costs, complete data privacy.
Browse GPU ServersFiled under: Tutorials