Together.ai Made It Easy to Start — and Expensive to Scale
Together.ai’s developer experience is genuinely excellent. Sign up, pick a model, make an API call — you’re running inference in under five minutes. For prototyping, that speed is invaluable. But a SaaS company that built their product on Together.ai’s Llama 3.1 70B endpoint learned the scaling math doesn’t work in their favour. At $0.88 per million input tokens and $0.88 per million output tokens, their 15 million daily tokens cost $26.40 per day — $792 per month. Manageable. Then their product gained traction, daily tokens climbed to 120 million, and the monthly bill hit $6,336. For the cost of seven months on Together.ai, they could run their own RTX 6000 Pro 96 GB for an entire year with unlimited tokens.
Migrating from Together.ai to a dedicated GPU is one of the highest-ROI infrastructure moves an AI startup can make. The API compatibility between Together.ai and self-hosted vLLM means the code changes are trivial — the savings are not.
Together.ai vs. Self-Hosted: What Changes
| Capability | Together.ai | Dedicated GPU + vLLM |
|---|---|---|
| Model access | Pre-hosted catalogue | Any open-source model |
| API format | OpenAI-compatible | OpenAI-compatible (vLLM) |
| Per-token cost | $0.88/M tokens (Llama 70B) | $0 after server cost |
| Rate limits | Tier-based throttling | None — limited only by GPU |
| Model customisation | Limited to available models | Any model, any quantisation, any config |
| Data privacy | Data processed on Together’s infra | Data never leaves your server |
| Latency | Variable (shared infrastructure) | Deterministic (dedicated hardware) |
Step-by-Step Migration
Step 1: Audit your Together.ai usage. Pull your usage dashboard data: which models you’re calling, token volumes per model, peak request rates, and average latency. This determines your GPU requirements. For Llama 3.1 70B serving 120M tokens/day, a single RTX 6000 Pro 96 GB handles the load comfortably.
Step 2: Deploy vLLM on dedicated hardware. Provision a GigaGPU dedicated server and install vLLM. The critical advantage: vLLM serves an OpenAI-compatible API endpoint, which Together.ai also uses. Your client code barely changes:
# Install and launch vLLM
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--max-model-len 8192 \
--port 8000 \
--tensor-parallel-size 1
Step 3: Update your client configuration. The migration is a two-line change. Replace Together.ai’s base URL and API key with your self-hosted endpoint:
# Before: Together.ai
client = OpenAI(
base_url="https://api.together.xyz/v1",
api_key="your-together-key"
)
# After: Self-hosted on GigaGPU
client = OpenAI(
base_url="http://your-gigagpu-server:8000/v1",
api_key="not-needed" # or your custom auth token
)
Step 4: Run parallel traffic. Route 10% of production requests to your self-hosted endpoint. Compare response quality (BLEU/ROUGE scores if applicable, or human evaluation), latency percentiles, and error rates. Most teams find response quality is identical since they’re running the same model weights.
Step 5: Full cutover. Shift 100% of traffic to self-hosted. Keep your Together.ai account active for 30 days as a fallback.
Handling Together.ai-Specific Features
Together.ai offers some features beyond raw inference that need replacement on self-hosted infrastructure:
- JSON mode: vLLM supports guided decoding with
response_format={"type": "json_object"}. Works identically. - Function calling: Available in vLLM with compatible models (Llama 3.1, Hermes-2 Pro). No code changes needed.
- Embedding endpoints: Deploy a separate embedding model (BGE, E5) alongside your LLM. Use the same vLLM instance or a lightweight FastAPI wrapper.
- Usage tracking: vLLM returns token counts in the response
usagefield. Pipe these to your analytics backend.
For teams using multiple open-source models through Together.ai, dedicated hardware lets you run them all on the same server with intelligent VRAM management.
Cost Savings at Scale
| Daily Token Volume | Together.ai Monthly | GigaGPU Monthly | Annual Savings |
|---|---|---|---|
| 10M tokens/day | $528 | ~$1,800 | -$15,264 (Together cheaper) |
| 50M tokens/day | $2,640 | ~$1,800 | $10,080 |
| 120M tokens/day | $6,336 | ~$1,800 | $54,432 |
| 500M tokens/day | $26,400 | ~$3,600 (2x RTX 6000 Pro) | $273,600 |
The crossover point is roughly 35-40 million tokens per day. Below that, Together.ai’s simplicity wins. Above it, the savings compound rapidly. Use the LLM cost calculator for your exact volume, and the GPU vs API cost comparison tool for a detailed breakdown.
Same API, Different Economics
The beauty of migrating from Together.ai is that almost nothing changes from your application’s perspective. Same API format, same model, same responses. The only difference appears on your invoice. For more on the economic case, see our Together.ai alternative comparison. Browse related migration guides in the tutorials section, and explore private AI hosting for data-sensitive workloads. The alternatives overview covers more providers.
Unlimited Tokens, Fixed Price
Stop paying per-token for inference you could own. GigaGPU dedicated GPU servers run the same open-source models at a fraction of Together.ai’s cost — with zero rate limits.
Browse GPU ServersFiled under: Tutorials