Table of Contents
LLM-as-judge: use a stronger model (Claude 3.7 Opus, GPT-4o) to grade outputs from your production model. Standard pattern for nuanced quality eval where rubric-based grading is expensive but binary metrics are inadequate. Has known biases that need management.
Pass (input, output, reference / rubric) to judge LLM; receive structured score. Rapid eval at moderate cost (~£0.01-0.10 per judgment with frontier API). Known biases: position bias (first response favoured), length bias (longer wins), self-preference (model favours own outputs). Mitigations: randomise position, normalise length, use a different family for judgment.
How it works
- For each test case: production model produces output
- Judge LLM receives: prompt, output, optional reference / rubric
- Judge produces structured score (1-5 rating, pass/fail, multi-dimensional)
- Aggregate scores across test set; track over time
Bias
- Position bias: when comparing two outputs, judge tends to favour first. Mitigation: randomise position; eval both orders.
- Length bias: longer outputs tend to score higher. Mitigation: normalise length in prompt; ask judge to ignore length; use rubric explicitly addressing concision.
- Self-preference: judges from same family as production model rate them higher. Mitigation: use different family (judge with Claude when production is Llama).
- Sycophancy: judge agrees with first claim, hedges. Mitigation: structured rubric, not open prose.
Cost
- Eval set of 500 prompts × 1 judgment each × £0.05/judgment = £25 per eval run
- CI eval on every PR: ~£100-500/month
- Self-hosted judge (Llama 3.3 70B): much cheaper but less reliable than frontier judge
- Sampling: subset for cheap continuous eval; full set for major-decision evals
Verdict
LLM-as-judge is the standard nuanced eval primitive in 2026. Manage biases (position, length, self-preference). Use frontier API for highest-quality judgments; self-hosted for continuous eval. Pair with rubric / reference where possible for stable grades over time.
Bottom line
Frontier judge for quality; manage biases. See eval design.