Table of Contents
Most teams deploy RAG and never check if it’s actually retrieving the right documents. This page is the eval pipeline you should run weekly.
Three metrics tier-one teams measure: retrieval recall@10 (is the right doc in top-10?), reranker precision@5, end-to-end faithfulness (does the answer cite from the retrieved context?). Run them on a 200-question hand-curated set weekly.
What to measure
- Retrieval recall@K — was the correct doc in top-K? Most important.
- Reranker precision@N — after reranking to top-N, what fraction are relevant?
- Answer faithfulness — does the LLM's answer use information from retrieved docs?
- Answer accuracy — is the answer correct (judged by human or another LLM)?
- Citation accuracy — does the cited chunk actually support the claim?
Eval pipeline setup
Tooling:
- RAGAS — Python library, runs faithfulness/precision metrics with an LLM judge
- Ragas + your own gold set — 200 hand-labeled Q-A-doc triples
- LLM-as-judge: Claude 3.5 Sonnet or GPT-4o for the judging step
Verdict
RAG eval is the boring infrastructure that makes RAG actually work. Run it weekly; treat regressions as bugs.
Bottom line
Without eval you cannot improve. See RAG architecture guide for the deployment side.