Table of Contents
Eval harnesses are the most important piece of production AI infrastructure that most teams build last. The right time to build one is before you ship; doing it after is harder because you're measuring a moving target.
Eval harness components: representative prompts (200-2000), expected outputs / grading rubric, automated grader (LLM-as-judge or rule-based), CI integration. Run on every model / prompt / RAG change. Gate production deploys on eval score holds. Track scores over time to detect drift.
Components
- Prompt set: 200-2000 representative prompts covering production use cases
- Expected outputs / rubrics: per-prompt grading criteria; either reference outputs or rubric for LLM-judge
- Grader: rule-based (regex / JSON validation) for structured outputs; LLM-as-judge (Claude / GPT-4o) for nuanced quality
- Aggregation: per-category and overall scores; track over time
- CI integration: runs on every PR touching prompts / models / RAG
- Production observation: optionally re-run subset against live traffic samples to detect drift
Grading
Three grading patterns:
- Exact match / pattern match: structured outputs (extraction, classification). Cheap, deterministic.
- LLM-as-judge: nuanced quality (writing, reasoning). Use a stronger model than the one being evaluated; randomise position to avoid bias.
- Human spot check: 10-20% sample reviewed by human; calibrates LLM-judge.
Automation
Run eval on:
- Every PR that touches prompts, models, RAG configs (CI)
- Daily on production-shadow traffic (drift detection)
- On-demand for ad-hoc experimentation
- Pre-deployment gate before promoting model versions
Verdict
Build the eval harness before shipping production AI. Without it, you can't safely change anything — every change is a quality gamble. With it, model upgrades, prompt tuning, and RAG iteration become routine engineering work.
Bottom line
Eval harness = first piece of production AI infra. See deployment checklist.