RTX 3050 - Order Now
Home / Blog / Tutorials / Migrate from Together.ai to Dedicated GPU: RAG Pipeline
Tutorials

Migrate from Together.ai to Dedicated GPU: RAG Pipeline

Self-host your RAG pipeline on dedicated GPUs instead of Together.ai for lower latency, zero per-query token costs, and full control over embedding and generation models.

Each RAG Query Hits Together.ai Twice — and You’re Paying Both Times

A knowledge management platform processes 80,000 RAG queries daily. Each query makes two Together.ai API calls: one to the embedding endpoint for query vectorisation, and one to the LLM endpoint for answer generation. At their token volumes — roughly 50 million embedding tokens and 120 million generation tokens per month — the combined bill reached $4,100 monthly. But cost wasn’t the only pain. The two sequential API calls added 800-1200ms of network latency per query on top of model inference time. For users expecting instant answers from their internal knowledge base, the 2-3 second total response time felt sluggish. And during Together.ai’s occasional rate-limiting periods, query latency spiked to 8-10 seconds, triggering a wave of support tickets.

A RAG pipeline is inherently a multi-model workflow: embedding, retrieval, reranking, and generation all happen in sequence. When these components run on the same dedicated GPU server, inter-step latency drops from hundreds of milliseconds to microseconds, and the per-query cost drops to effectively zero.

Together.ai RAG vs. Self-Hosted RAG

RAG ComponentTogether.ai ApproachDedicated GPU Approach
Query embeddingAPI call (~200ms network + inference)Local inference (~5ms)
Document retrievalYour vector DB (same either way)Your vector DB (co-located)
RerankingAPI call or skip (~200ms)Local inference (~10ms)
Answer generationAPI call (~600-1000ms)Local vLLM (~200-400ms TTFT)
Total pipeline latency1.5-3.0 seconds0.4-0.8 seconds
Per-query cost (est.)$0.0017$0 marginal

Architecture for Self-Hosted RAG

The self-hosted RAG pipeline consolidates three Together.ai API calls into local model inference on a single server. Here’s the architecture:

Step 1: Deploy your models. On a GigaGPU dedicated server with an RTX 6000 Pro 96 GB, you can run all RAG components simultaneously:

# Component 1: Embedding model (runs on ~2GB VRAM)
# Using sentence-transformers for query/document embedding
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda:0")

# Component 2: Reranker (runs on ~1GB VRAM)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", device="cuda:0")

# Component 3: LLM via vLLM (uses remaining VRAM)
# Launch separately:
# python -m vllm.entrypoints.openai.api_server \
#   --model meta-llama/Llama-3.1-70B-Instruct \
#   --gpu-memory-utilization 0.6 --port 8001

Step 2: Co-locate your vector database. Run your vector store (Qdrant, Milvus, Weaviate, or pgvector) on the same server or on a companion server in the same data centre. Keeping the vector DB co-located with the inference stack eliminates the network round-trip for retrieval:

# Qdrant running locally
docker run -d -p 6333:6333 -v /data/qdrant:/qdrant/storage qdrant/qdrant

# Full RAG query — all local
import requests, httpx

def rag_query(question: str):
    # Step 1: Embed query (local, ~5ms)
    query_vec = embedder.encode(question).tolist()

    # Step 2: Retrieve documents (local Qdrant, ~10ms)
    hits = requests.post("http://localhost:6333/collections/docs/points/search",
        json={"vector": query_vec, "limit": 20}).json()

    # Step 3: Rerank (local, ~10ms)
    pairs = [(question, hit["payload"]["text"]) for hit in hits["result"]]
    scores = reranker.predict(pairs)
    top_docs = [pairs[i][1] for i in scores.argsort()[-5:]]

    # Step 4: Generate answer (local vLLM, ~200-400ms)
    context = "\n\n".join(top_docs)
    response = httpx.post("http://localhost:8001/v1/chat/completions",
        json={"model": "meta-llama/Llama-3.1-70B-Instruct",
              "messages": [{"role": "user",
                  "content": f"Context:\n{context}\n\nQuestion: {question}"}],
              "max_tokens": 512, "stream": True})
    return response

Step 3: Migrate your application layer. Update your RAG orchestration code (LangChain, LlamaIndex, or custom) to point at local endpoints instead of Together.ai’s API. If you’re using vLLM’s OpenAI-compatible endpoint, the LLM portion requires only a base URL change.

Step 4: Rebuild your index. If switching embedding models (e.g., from Together.ai’s hosted model to a locally-run BGE variant), re-embed your document corpus using the local embedding model. This is a one-time operation — run it as a batch job on the same GPU.

Latency Breakdown

The performance improvement is dramatic because RAG pipelines are latency-sensitive and involve multiple sequential steps. Each step that previously required a network round-trip to Together.ai now runs locally:

  • Embedding latency: 200ms (API) drops to 5ms (local). Savings: 195ms per query.
  • Reranking latency: 200ms (API) drops to 10ms (local). Savings: 190ms per query.
  • Generation TTFT: 600-1000ms (API, shared infra) drops to 200-400ms (dedicated GPU). Savings: 400-600ms.
  • Total improvement: 60-75% faster end-to-end query processing.

For users of your knowledge base, this means answers appear in under a second instead of 2-3 seconds. That difference transforms the perceived quality of the product. Host your full open-source model stack on dedicated hardware for the fastest possible RAG experience.

Cost Comparison

Monthly Query VolumeTogether.ai MonthlyGigaGPU MonthlyLatency Improvement
10,000 queries/day~$510~$1,80060-75% faster
50,000 queries/day~$2,550~$1,80060-75% faster
80,000 queries/day~$4,100~$1,80060-75% faster
200,000 queries/day~$10,200~$3,600 (2x RTX 6000 Pro)60-75% faster

The breakeven is approximately 35,000 queries per day. Above that, dedicated hardware is both cheaper and faster. The GPU vs API cost comparison tool models your exact query patterns.

RAG That Feels Instant

When every RAG component runs on the same hardware, the pipeline stops feeling like a chain of API calls and starts feeling like a single operation. Users get faster answers, you pay less per query, and your data never leaves your infrastructure — critical for knowledge bases containing private or sensitive information.

More resources: the Together.ai alternative comparison, the LLM cost calculator, and our tutorials section. The cost analysis section has deeper economic comparisons across providers.

Sub-Second RAG on Your Own Hardware

Consolidate your entire RAG pipeline — embedding, reranking, and generation — on a single GigaGPU dedicated server. Faster queries, zero per-token costs, complete data privacy.

Browse GPU Servers

Filed under: Tutorials

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?