Table of Contents
Together AI built one of the cleanest open-weight model APIs on the market — Llama, Mistral, Qwen, DeepSeek, etc. all available via an OpenAI-compatible endpoint at competitive per-token prices. They are the default recommendation for teams that want hosted open-weight inference without operating any infrastructure.
That said, there are workloads where Together is not the right answer. This page covers the strongest alternatives.
For cheaper hosted: Fireworks AI is comparable, sometimes cheaper. For data residency: self-host on a dedicated GPU like GigaGPU. For latency-critical workloads: Groq (LPU) or self-hosted in your region. For frontier-class: OpenAI / Anthropic remain stronger than open-weight models on hardest tasks. For cost-anchored at high volume: self-hosting wins above ~£1,500/mo.
Why look beyond Together AI
Common reasons we hear:
- Data residency. Together is US-hosted; UK/EU regulated workloads cannot always send prompts there.
- Custom fine-tunes. Together does support fine-tuning, but if your model needs deeper customisation (LoRA stacks, full SFT) you will want self-hosting.
- Token volume. At >1B tokens/month the per-token bill exceeds a dedicated GPU rental.
- Latency. US-hosted means 80–150 ms RTT from EU before any inference happens.
- Rate limits. Tier-based; new accounts hit walls on burst traffic.
- Model availability. Together rotates which models they host. A model your application depends on can disappear.
Hosted alternatives (per-token APIs)
Fireworks AI
Closest peer to Together. Similar model selection, similar pricing, sometimes faster on specific cards. OpenAI-compatible API. Solid second-source. See our Fireworks alternatives for when even Fireworks is not right.
Groq (LPU)
Custom Language Processing Unit hardware delivering 500–1500 tok/s on Llama 3 70B. Latency-sensitive workloads (voice agents, real-time copilots) benefit massively. Pricing per-token is competitive with Together. Limited model selection.
Cerebras Inference
Wafer-scale chips. Even faster than Groq on supported models. Llama 3.1 / 3.3, Qwen, DeepSeek-R1. Pricing: similar to Together for 70B-class.
OpenAI / Anthropic / Google
Closed-source frontier APIs. More expensive, generally higher quality on hardest tasks. Pick when quality > cost or when you need specific capabilities (Claude's coding, GPT-4o vision).
Hyperbolic
Newer entrant. Focus on open-weight models with aggressive pricing. Worth shortlisting for cost-anchored deployments.
DeepInfra
Long-running open-weight host. Stable pricing, broad model selection, OpenAI-compatible. The boring-but-reliable choice.
Self-hosted alternatives (dedicated GPU)
GigaGPU dedicated
UK-hosted bare-metal GPU servers. RTX 3050 (£79) through RTX 6000 Pro (£899) and multi-GPU clusters. Fixed monthly. Full root. The default self-hosted recommendation for any workload above ~£500/mo of Together usage. Catalogue.
RunPod Pods
Per-hour GPU pods with persistent storage. Useful when you want self-hosted but do not want to commit to a month.
Lambda Reserved
1-year reservations on H100 / GH200 clusters. The right answer for serious training workloads.
Hybrid: a router + multiple backends
The pattern that works best for teams above ~£3,000/mo of API spend:
- Run your steady traffic (chat, embeddings, common queries) on dedicated GPU hardware
- Send spiky / occasional traffic to Together / Fireworks
- Send frontier-quality / vision / function-calling queries to OpenAI / Anthropic
- Use LiteLLM as the router with model-name-based fan-out
This routes 80% of token volume to the cheapest path while keeping the quality tail accessible. Cost-effective and gives you redundancy.
Comparison matrix
| Provider | Pricing | Llama 3 70B (per 1M) | EU residency | OpenAI-compatible |
|---|---|---|---|---|
| Together AI | Per-token | £0.66 | No (US) | Yes |
| Fireworks AI | Per-token | £0.71 | No (US) | Yes |
| Groq | Per-token | £0.59 | No (US) | Yes |
| Cerebras | Per-token | £0.85 | No (US) | Yes |
| DeepInfra | Per-token | £0.55 | No (US) | Yes |
| Hyperbolic | Per-token | £0.45 | No (US) | Yes |
| GigaGPU 2× 5090 self-hosted | Fixed £899/mo | £0.95 (at 60% util) | UK | Yes via vLLM |
| GigaGPU 6000 Pro self-hosted | Fixed £899/mo | £1.61 (at 60% util) | UK | Yes via vLLM |
Verdict
- Cheapest hosted Llama 3 70B: Hyperbolic or DeepInfra at <£0.55/1M.
- Fastest inference: Groq or Cerebras for raw tok/s.
- Most data control: self-hosted dedicated GPU. See private AI hosting.
- Best general second-source: Fireworks AI.
- Frontier quality: OpenAI gpt-4o or Anthropic Claude 3.5 Sonnet.
Bottom line
The right Together replacement depends on what you wanted Together for. Cost: Hyperbolic / DeepInfra. Data control: self-hosted GigaGPU. Latency: Groq. Quality: OpenAI / Anthropic. The most resilient architecture is multi-backend with a router; nobody who depends on a single inference provider sleeps well.