Table of Contents
Picking an LLM by leaderboard is risky — your workload is not the leaderboard. Self-hosted eval lets you measure on your actual prompts, with your actual SLA, on your actual hardware.
Use lm-evaluation-harness for standard benchmarks (MMLU, MATH, HumanEval). Add a custom 200-prompt eval set scored by an LLM judge (Claude 3.5 Sonnet or GPT-4o). Run weekly in CI; alert on >3% regression.
Standard benchmarks
pip install lm-eval
lm-eval --model vllm \
--model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,quantization=fp8 \
--tasks mmlu,mathqa,humaneval \
--batch_size auto \
--output_path eval-results/
Custom evaluation
Build a 200-prompt set covering your workload distribution. Score each response with an LLM judge:
import openai
judge = openai.OpenAI(base_url="https://api.anthropic.com/...", ...)
def score(prompt, response):
judgement = judge.chat.completions.create(
model="claude-3-5-sonnet-latest",
messages=[{"role":"user","content": JUDGE_PROMPT.format(p=prompt, r=response)}],
)
return parse_score(judgement.choices[0].message.content)
CI integration
Run on every model upgrade and weekly. Compare against baseline; alert if any benchmark drops >3%.
Verdict
Without eval you can’t ship model upgrades safely. With eval you can confidently swap models monthly. Start with lm-evaluation-harness; add custom eval as you understand your workload.
Bottom line
Eval first, deploy second. See RAG eval guide for the retrieval-side complement.