Table of Contents
For production RAG, you need to measure both retrieval quality (did we get the right chunks?) and generation quality (did the LLM use them well?). The metrics for each are different; missing either side leaves blind spots.
Retrieval: recall@K (did relevant chunks appear in top K?), MRR (where did the first relevant chunk appear?), NDCG (rank-weighted relevance). Generation: faithfulness (does answer match retrieved context?), answer relevance (does answer match question?). End-to-end: RAGAS framework combines all of these.
Retrieval metrics
- Recall@K: out of all known relevant chunks for a query, how many appear in top K? Most important for RAG quality.
- Precision@K: out of top K returned chunks, how many are relevant? Less critical because LLM filters noise.
- MRR (Mean Reciprocal Rank): 1/rank of first relevant result; measures if relevant chunks rank high.
- NDCG (Normalised Discounted Cumulative Gain): rank-weighted relevance; standard IR metric.
Generation metrics
- Faithfulness: does the answer make claims grounded in the retrieved context, or does it hallucinate? Critical for RAG.
- Answer relevance: does the answer actually address the question?
- Context relevance: were the retrieved chunks actually relevant to the question?
- Citation accuracy: do citations point to chunks that support the claim?
End-to-end
RAGAS (RAG Assessment) is a Python library combining these metrics. It uses LLM-as-judge for generation metrics; you supply the eval dataset (questions + reference answers + relevant chunks).
Run RAGAS in your CI pipeline; track metrics over time; gate deploys on regression.
Verdict
For production RAG, measure retrieval and generation separately. Recall@10 + faithfulness + answer relevance are the high-leverage metrics. RAGAS is the right framework. Build your eval dataset of 200-500 representative queries with relevant chunks marked; that's the foundation everything else rests on.
Bottom line
Recall@K + faithfulness are the headline metrics. See RAG chunking.