Evals Approach

Task 5: Evaluation Framework and Metrics

Evaluation Strategy

We use the RAGAS (Retrieval-Augmented Generation Assessment) framework to evaluate both the RAG pipeline and the agent system. RAGAS provides standardized, LLM-powered metrics that can be run against a test dataset to produce quantitative quality scores.

The evaluation is split into two tiers: RAG metrics that assess retrieval and generation quality, and Agent metrics that assess tool usage, goal achievement, and topic adherence. Both tiers use GPT-4o-mini as the evaluation LLM for consistent, cost-effective scoring.

RAG Evaluation Metrics

Metric	Description	Range
LLMContextRecall	Measures whether all relevant context was retrieved from the knowledge base.	0.0 - 1.0
LLMContextPrecision	Measures whether the retrieved context is precise and not diluted with irrelevant documents.	0.0 - 1.0
Faithfulness	Checks if the generated response is factually grounded in the retrieved context.	0.0 - 1.0
FactualCorrectness	Validates the factual accuracy of claims made in the response.	0.0 - 1.0
ResponseRelevancy	Measures how well the response addresses the original question.	0.0 - 1.0
ContextEntityRecall	Entity-level recall: did we retrieve documents mentioning the key entities?	0.0 - 1.0

Agent Evaluation Metrics

Metric	Description	Range
ToolCallAccuracy	Evaluates if the agent called the correct tools with the correct arguments.	0.0 - 1.0
AgentGoalAccuracyWithReference	Binary metric: did the agent achieve the user's stated goal?	0 or 1
TopicAdherenceScore	Measures if the agent stayed on financial analysis topics throughout the conversation.	0.0 - 1.0

Test Dataset

The evaluation test set consists of two sources:

Manual test cases (backend/test_data/manual_test_cases.json) — Hand-crafted question/answer/context triples covering all agent capabilities.
Synthetic test cases — Generated using RAGAS TestsetGenerator with a knowledge-graph approach over the business knowledge base documents.