Home / Blog / Tutorials / Migrate from Together.ai to Dedicated GPU: RAG Pipeline

Tutorials

Migrate from Together.ai to Dedicated GPU: RAG Pipeline

Self-host your RAG pipeline on dedicated GPUs instead of Together.ai for lower latency, zero per-query token costs, and full control over embedding and generation models.

Tutorials April 16, 2026 4 min read admin

Each RAG Query Hits Together.ai Twice — and You’re Paying Both Times

A knowledge management platform processes 80,000 RAG queries daily. Each query makes two Together.ai API calls: one to the embedding endpoint for query vectorisation, and one to the LLM endpoint for answer generation. At their token volumes — roughly 50 million embedding tokens and 120 million generation tokens per month — the combined bill reached $4,100 monthly. But cost wasn’t the only pain. The two sequential API calls added 800-1200ms of network latency per query on top of model inference time. For users expecting instant answers from their internal knowledge base, the 2-3 second total response time felt sluggish. And during Together.ai’s occasional rate-limiting periods, query latency spiked to 8-10 seconds, triggering a wave of support tickets.

A RAG pipeline is inherently a multi-model workflow: embedding, retrieval, reranking, and generation all happen in sequence. When these components run on the same dedicated GPU server, inter-step latency drops from hundreds of milliseconds to microseconds, and the per-query cost drops to effectively zero.

Together.ai RAG vs. Self-Hosted RAG

RAG Component	Together.ai Approach	Dedicated GPU Approach
Query embedding	API call (~200ms network + inference)	Local inference (~5ms)
Document retrieval	Your vector DB (same either way)	Your vector DB (co-located)
Reranking	API call or skip (~200ms)	Local inference (~10ms)
Answer generation	API call (~600-1000ms)	Local vLLM (~200-400ms TTFT)
Total pipeline latency	1.5-3.0 seconds	0.4-0.8 seconds
Per-query cost (est.)	$0.0017	$0 marginal

Architecture for Self-Hosted RAG

The self-hosted RAG pipeline consolidates three Together.ai API calls into local model inference on a single server. Here’s the architecture:

Step 1: Deploy your models. On a GigaGPU dedicated server with an RTX 6000 Pro 96 GB, you can run all RAG components simultaneously:

# Component 1: Embedding model (runs on ~2GB VRAM)
# Using sentence-transformers for query/document embedding
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda:0")

# Component 2: Reranker (runs on ~1GB VRAM)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", device="cuda:0")

# Component 3: LLM via vLLM (uses remaining VRAM)
# Launch separately:
# python -m vllm.entrypoints.openai.api_server \
#   --model meta-llama/Llama-3.1-70B-Instruct \
#   --gpu-memory-utilization 0.6 --port 8001

Step 2: Co-locate your vector database. Run your vector store (Qdrant, Milvus, Weaviate, or pgvector) on the same server or on a companion server in the same data centre. Keeping the vector DB co-located with the inference stack eliminates the network round-trip for retrieval:

# Qdrant running locally
docker run -d -p 6333:6333 -v /data/qdrant:/qdrant/storage qdrant/qdrant

# Full RAG query — all local
import requests, httpx

def rag_query(question: str):
    # Step 1: Embed query (local, ~5ms)
    query_vec = embedder.encode(question).tolist()

    # Step 2: Retrieve documents (local Qdrant, ~10ms)
    hits = requests.post("http://localhost:6333/collections/docs/points/search",
        json={"vector": query_vec, "limit": 20}).json()

    # Step 3: Rerank (local, ~10ms)
    pairs = [(question, hit["payload"]["text"]) for hit in hits["result"]]
    scores = reranker.predict(pairs)
    top_docs = [pairs[i][1] for i in scores.argsort()[-5:]]

    # Step 4: Generate answer (local vLLM, ~200-400ms)
    context = "\n\n".join(top_docs)
    response = httpx.post("http://localhost:8001/v1/chat/completions",
        json={"model": "meta-llama/Llama-3.1-70B-Instruct",
              "messages": [{"role": "user",
                  "content": f"Context:\n{context}\n\nQuestion: {question}"}],
              "max_tokens": 512, "stream": True})
    return response

Step 3: Migrate your application layer. Update your RAG orchestration code (LangChain, LlamaIndex, or custom) to point at local endpoints instead of Together.ai’s API. If you’re using vLLM’s OpenAI-compatible endpoint, the LLM portion requires only a base URL change.

Step 4: Rebuild your index. If switching embedding models (e.g., from Together.ai’s hosted model to a locally-run BGE variant), re-embed your document corpus using the local embedding model. This is a one-time operation — run it as a batch job on the same GPU.

Latency Breakdown

The performance improvement is dramatic because RAG pipelines are latency-sensitive and involve multiple sequential steps. Each step that previously required a network round-trip to Together.ai now runs locally:

Embedding latency: 200ms (API) drops to 5ms (local). Savings: 195ms per query.
Reranking latency: 200ms (API) drops to 10ms (local). Savings: 190ms per query.
Generation TTFT: 600-1000ms (API, shared infra) drops to 200-400ms (dedicated GPU). Savings: 400-600ms.
Total improvement: 60-75% faster end-to-end query processing.

For users of your knowledge base, this means answers appear in under a second instead of 2-3 seconds. That difference transforms the perceived quality of the product. Host your full open-source model stack on dedicated hardware for the fastest possible RAG experience.

Cost Comparison

Monthly Query Volume	Together.ai Monthly	GigaGPU Monthly	Latency Improvement
10,000 queries/day	~$510	~$1,800	60-75% faster
50,000 queries/day	~$2,550	~$1,800	60-75% faster
80,000 queries/day	~$4,100	~$1,800	60-75% faster
200,000 queries/day	~$10,200	~$3,600 (2x RTX 6000 Pro)	60-75% faster

The breakeven is approximately 35,000 queries per day. Above that, dedicated hardware is both cheaper and faster. The GPU vs API cost comparison tool models your exact query patterns.

RAG That Feels Instant

When every RAG component runs on the same hardware, the pipeline stops feeling like a chain of API calls and starts feeling like a single operation. Users get faster answers, you pay less per query, and your data never leaves your infrastructure — critical for knowledge bases containing private or sensitive information.

More resources: the Together.ai alternative comparison, the LLM cost calculator, and our tutorials section. The cost analysis section has deeper economic comparisons across providers.

Sub-Second RAG on Your Own Hardware

Consolidate your entire RAG pipeline — embedding, reranking, and generation — on a single GigaGPU dedicated server. Faster queries, zero per-token costs, complete data privacy.

Browse GPU Servers

Filed under: Tutorials

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Migrate from Together.ai to Dedicated GPU: RAG Pipeline

Each RAG Query Hits Together.ai Twice — and You’re Paying Both Times

Together.ai RAG vs. Self-Hosted RAG

Architecture for Self-Hosted RAG

Latency Breakdown

Cost Comparison

RAG That Feels Instant

Sub-Second RAG on Your Own Hardware

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Migrate from Together.ai to Dedicated GPU: RAG Pipeline

Each RAG Query Hits Together.ai Twice — and You’re Paying Both Times

Together.ai RAG vs. Self-Hosted RAG

Architecture for Self-Hosted RAG

Latency Breakdown

Cost Comparison

RAG That Feels Instant

Sub-Second RAG on Your Own Hardware

Need a Dedicated GPU Server?

admin

Related Articles

Migrate from Lambda to Dedicated GPU: Dataset Processing

Migrate from Anthropic to Self-Hosted: Customer Support Guide

Migrate from Google Vertex to Dedicated GPU: Translation Pipeline Guide

How to Optimise vLLM Memory Usage for Maximum Throughput

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?