Evaluating 30 Models on Together.ai Cost More Than the GPU to Run Them All
An AI consultancy needed to benchmark 30 open-source models across five evaluation suites for a client report. Using Together.ai’s API seemed logical — most models were already hosted there. They ran MMLU, HellaSwag, TruthfulQA, HumanEval, and a custom domain-specific benchmark across all 30 models. Each evaluation suite required thousands of inference calls per model. The total: approximately 4.5 million API calls generating 900 million tokens. Together.ai’s bill for this single evaluation project: $1,584. The evaluation took nine days due to rate limiting across multiple model endpoints. For the price of that one evaluation round, they could have leased an RTX 6000 Pro 96 GB for nearly a month and run unlimited evaluations on their own schedule.
Model evaluation is a throughput-intensive, repetitive workload — exactly the kind that dedicated GPU hardware handles most cost-effectively. Self-hosting evaluations also guarantees reproducibility, since you control the exact model weights, quantisation, and inference parameters.
Why Evaluation Needs Dedicated Infrastructure
| Evaluation Need | Together.ai Limitation | Dedicated GPU Advantage |
|---|---|---|
| Model coverage | Limited to Together’s catalogue | Any model on Hugging Face or custom |
| Evaluation speed | Rate-limited per model endpoint | Full GPU throughput, no throttling |
| Reproducibility | Backend quantisation may change | You control exact model config |
| Custom benchmarks | Requires API wrappers | Direct model access, any eval framework |
| Cost per evaluation run | $50-2,000+ (token-based) | $0 marginal on dedicated hardware |
| Concurrent evaluations | Rate limits per endpoint | Queue evaluations on local GPU |
Setting Up a Dedicated Evaluation Server
Step 1: Provision hardware. For evaluating models up to 70B parameters, a single RTX 6000 Pro 96 GB on GigaGPU handles most scenarios. For parallel evaluation of multiple models (e.g., running evals on a 70B while a 7B evaluates simultaneously), consider dual-GPU configurations.
Step 2: Install evaluation frameworks. Set up your evaluation toolkit. The open-source ecosystem offers comprehensive options that don’t require API calls:
# lm-evaluation-harness — the standard for LLM benchmarking
pip install lm-eval
# Run MMLU on a local model
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.1-70B-Instruct \
--tasks mmlu \
--batch_size auto \
--output_path /results/llama-70b-mmlu/
# Run multiple benchmarks in sequence
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.1-70B-Instruct \
--tasks mmlu,hellaswag,truthfulqa_mc2,winogrande \
--batch_size auto \
--output_path /results/llama-70b-full/
Step 3: Create an evaluation pipeline. Build an automated pipeline that downloads models, runs your evaluation suite, and stores results in a structured format:
#!/bin/bash
# evaluate_model.sh — evaluate a single model across all benchmarks
MODEL=$1
OUTPUT_DIR="/results/$(echo $MODEL | tr '/' '_')"
mkdir -p "$OUTPUT_DIR"
BENCHMARKS="mmlu,hellaswag,truthfulqa_mc2,winogrande,arc_challenge"
lm_eval --model hf \
--model_args "pretrained=$MODEL,dtype=float16" \
--tasks $BENCHMARKS \
--batch_size auto \
--output_path "$OUTPUT_DIR"
# Generate summary
python /tools/summarise_results.py "$OUTPUT_DIR" >> /results/summary.csv
Step 4: Migrate your custom benchmarks. If you’ve built domain-specific evaluation suites that ran against Together.ai’s API, convert them to use local model inference. The lm-evaluation-harness supports custom tasks, or you can use vLLM’s Python API directly for custom evaluation logic.
Evaluation Workflow Advantages
Self-hosted evaluation on dedicated hardware changes how your team approaches model selection:
- Evaluate any model: Not limited to Together.ai’s catalogue. Test models from Hugging Face, custom fine-tunes, or unreleased checkpoints — including your own open-source models.
- Run overnight sweeps: Queue 20 model evaluations on Friday evening, collect results Monday morning. Zero marginal cost per evaluation run.
- Consistent quantisation: Control exactly how each model is loaded — FP16, BF16, GPTQ, AWQ. Together.ai’s backend quantisation choices may differ from documented specifications.
- Custom metrics: Implement domain-specific evaluation metrics that require model internals (attention patterns, hidden states, logits analysis) — impossible through an API.
Cost Comparison
| Evaluation Scenario | Together.ai Cost | GigaGPU Monthly | Evaluations per Month |
|---|---|---|---|
| 5 models, 3 benchmarks | ~$264 | ~$1,800 | Unlimited on dedicated |
| 15 models, 5 benchmarks | ~$792 | ~$1,800 | Unlimited on dedicated |
| 30 models, 5 benchmarks | ~$1,584 | ~$1,800 | Unlimited on dedicated |
| 30 models, 5 benchmarks, monthly | ~$19,008/year | ~$21,600/year | Comparable, dedicated more flexible |
| Continuous eval (weekly runs) | ~$6,336/month | ~$1,800/month | 72% savings on dedicated |
Dedicated hardware becomes cheaper once you’re running more than two full evaluation sweeps per month — common for teams that evaluate new model releases, fine-tune iterations, or maintain leaderboards. The LLM cost calculator can model your evaluation throughput requirements.
Evaluation as a First-Class Capability
Moving model evaluation from Together.ai to dedicated hardware transforms it from an expensive project into a routine operation. When evaluating a new model costs nothing beyond your existing server, you evaluate more often, more thoroughly, and make better model selection decisions as a result.
Related resources: our Together.ai alternative comparison, vLLM hosting for serving your chosen models, and private AI hosting for evaluating models with proprietary data. The GPU vs API cost comparison covers the broader economics. Browse the tutorials and alternatives sections for more.
Evaluate Every Model, Every Week, for One Fixed Price
Dedicated GPU servers from GigaGPU turn model evaluation from a per-token expense into an unlimited capability. Run benchmarks on any model, any time.
Browse GPU ServersFiled under: Tutorials