Evals Approach
Task 5: Evaluation Framework and Metrics
Evaluation Strategy
We use the RAGAS (Retrieval-Augmented Generation Assessment) framework to evaluate both the RAG pipeline and the agent system. RAGAS provides standardized, LLM-powered metrics that can be run against a test dataset to produce quantitative quality scores.
The evaluation is split into two tiers: RAG metrics that assess retrieval and generation quality, and Agent metrics that assess tool usage, goal achievement, and topic adherence. Both tiers use GPT-4o-mini as the evaluation LLM for consistent, cost-effective scoring.
RAG Evaluation Metrics
| Metric | Description | Range |
|---|---|---|
| LLMContextRecall | Measures whether all relevant context was retrieved from the knowledge base. | 0.0 - 1.0 |
| LLMContextPrecision | Measures whether the retrieved context is precise and not diluted with irrelevant documents. | 0.0 - 1.0 |
| Faithfulness | Checks if the generated response is factually grounded in the retrieved context. | 0.0 - 1.0 |
| FactualCorrectness | Validates the factual accuracy of claims made in the response. | 0.0 - 1.0 |
| ResponseRelevancy | Measures how well the response addresses the original question. | 0.0 - 1.0 |
| ContextEntityRecall | Entity-level recall: did we retrieve documents mentioning the key entities? | 0.0 - 1.0 |
Agent Evaluation Metrics
| Metric | Description | Range |
|---|---|---|
| ToolCallAccuracy | Evaluates if the agent called the correct tools with the correct arguments. | 0.0 - 1.0 |
| AgentGoalAccuracyWithReference | Binary metric: did the agent achieve the user's stated goal? | 0 or 1 |
| TopicAdherenceScore | Measures if the agent stayed on financial analysis topics throughout the conversation. | 0.0 - 1.0 |
Test Dataset
The evaluation test set consists of two sources:
- Manual test cases (
backend/test_data/manual_test_cases.json) — Hand-crafted question/answer/context triples covering all agent capabilities. - Synthetic test cases — Generated using RAGAS TestsetGenerator with a knowledge-graph approach over the business knowledge base documents.