RTX 3050 - Order Now
Home / Blog / Tutorials / Migrate from Together.ai to Dedicated GPU: API Serving
Tutorials

Migrate from Together.ai to Dedicated GPU: API Serving

Replace Together.ai's managed API with self-hosted LLM inference on dedicated GPUs for lower per-token costs, zero rate limits, and complete control over model configuration.

Together.ai Made It Easy to Start — and Expensive to Scale

Together.ai’s developer experience is genuinely excellent. Sign up, pick a model, make an API call — you’re running inference in under five minutes. For prototyping, that speed is invaluable. But a SaaS company that built their product on Together.ai’s Llama 3.1 70B endpoint learned the scaling math doesn’t work in their favour. At $0.88 per million input tokens and $0.88 per million output tokens, their 15 million daily tokens cost $26.40 per day — $792 per month. Manageable. Then their product gained traction, daily tokens climbed to 120 million, and the monthly bill hit $6,336. For the cost of seven months on Together.ai, they could run their own RTX 6000 Pro 96 GB for an entire year with unlimited tokens.

Migrating from Together.ai to a dedicated GPU is one of the highest-ROI infrastructure moves an AI startup can make. The API compatibility between Together.ai and self-hosted vLLM means the code changes are trivial — the savings are not.

Together.ai vs. Self-Hosted: What Changes

CapabilityTogether.aiDedicated GPU + vLLM
Model accessPre-hosted catalogueAny open-source model
API formatOpenAI-compatibleOpenAI-compatible (vLLM)
Per-token cost$0.88/M tokens (Llama 70B)$0 after server cost
Rate limitsTier-based throttlingNone — limited only by GPU
Model customisationLimited to available modelsAny model, any quantisation, any config
Data privacyData processed on Together’s infraData never leaves your server
LatencyVariable (shared infrastructure)Deterministic (dedicated hardware)

Step-by-Step Migration

Step 1: Audit your Together.ai usage. Pull your usage dashboard data: which models you’re calling, token volumes per model, peak request rates, and average latency. This determines your GPU requirements. For Llama 3.1 70B serving 120M tokens/day, a single RTX 6000 Pro 96 GB handles the load comfortably.

Step 2: Deploy vLLM on dedicated hardware. Provision a GigaGPU dedicated server and install vLLM. The critical advantage: vLLM serves an OpenAI-compatible API endpoint, which Together.ai also uses. Your client code barely changes:

# Install and launch vLLM
pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --max-model-len 8192 \
  --port 8000 \
  --tensor-parallel-size 1

Step 3: Update your client configuration. The migration is a two-line change. Replace Together.ai’s base URL and API key with your self-hosted endpoint:

# Before: Together.ai
client = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key="your-together-key"
)

# After: Self-hosted on GigaGPU
client = OpenAI(
    base_url="http://your-gigagpu-server:8000/v1",
    api_key="not-needed"  # or your custom auth token
)

Step 4: Run parallel traffic. Route 10% of production requests to your self-hosted endpoint. Compare response quality (BLEU/ROUGE scores if applicable, or human evaluation), latency percentiles, and error rates. Most teams find response quality is identical since they’re running the same model weights.

Step 5: Full cutover. Shift 100% of traffic to self-hosted. Keep your Together.ai account active for 30 days as a fallback.

Handling Together.ai-Specific Features

Together.ai offers some features beyond raw inference that need replacement on self-hosted infrastructure:

  • JSON mode: vLLM supports guided decoding with response_format={"type": "json_object"}. Works identically.
  • Function calling: Available in vLLM with compatible models (Llama 3.1, Hermes-2 Pro). No code changes needed.
  • Embedding endpoints: Deploy a separate embedding model (BGE, E5) alongside your LLM. Use the same vLLM instance or a lightweight FastAPI wrapper.
  • Usage tracking: vLLM returns token counts in the response usage field. Pipe these to your analytics backend.

For teams using multiple open-source models through Together.ai, dedicated hardware lets you run them all on the same server with intelligent VRAM management.

Cost Savings at Scale

Daily Token VolumeTogether.ai MonthlyGigaGPU MonthlyAnnual Savings
10M tokens/day$528~$1,800-$15,264 (Together cheaper)
50M tokens/day$2,640~$1,800$10,080
120M tokens/day$6,336~$1,800$54,432
500M tokens/day$26,400~$3,600 (2x RTX 6000 Pro)$273,600

The crossover point is roughly 35-40 million tokens per day. Below that, Together.ai’s simplicity wins. Above it, the savings compound rapidly. Use the LLM cost calculator for your exact volume, and the GPU vs API cost comparison tool for a detailed breakdown.

Same API, Different Economics

The beauty of migrating from Together.ai is that almost nothing changes from your application’s perspective. Same API format, same model, same responses. The only difference appears on your invoice. For more on the economic case, see our Together.ai alternative comparison. Browse related migration guides in the tutorials section, and explore private AI hosting for data-sensitive workloads. The alternatives overview covers more providers.

Unlimited Tokens, Fixed Price

Stop paying per-token for inference you could own. GigaGPU dedicated GPU servers run the same open-source models at a fraction of Together.ai’s cost — with zero rate limits.

Browse GPU Servers

Filed under: Tutorials

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?