Dy
Dysprosium Financial Assistant

Evals Approach

Task 5: Evaluation Framework and Metrics
Evaluation Strategy

We use the RAGAS (Retrieval-Augmented Generation Assessment) framework to evaluate both the RAG pipeline and the agent system. RAGAS provides standardized, LLM-powered metrics that can be run against a test dataset to produce quantitative quality scores.

The evaluation is split into two tiers: RAG metrics that assess retrieval and generation quality, and Agent metrics that assess tool usage, goal achievement, and topic adherence. Both tiers use GPT-4o-mini as the evaluation LLM for consistent, cost-effective scoring.

RAG Evaluation Metrics
MetricDescriptionRange
LLMContextRecallMeasures whether all relevant context was retrieved from the knowledge base.0.0 - 1.0
LLMContextPrecisionMeasures whether the retrieved context is precise and not diluted with irrelevant documents.0.0 - 1.0
FaithfulnessChecks if the generated response is factually grounded in the retrieved context.0.0 - 1.0
FactualCorrectnessValidates the factual accuracy of claims made in the response.0.0 - 1.0
ResponseRelevancyMeasures how well the response addresses the original question.0.0 - 1.0
ContextEntityRecallEntity-level recall: did we retrieve documents mentioning the key entities?0.0 - 1.0
Agent Evaluation Metrics
MetricDescriptionRange
ToolCallAccuracyEvaluates if the agent called the correct tools with the correct arguments.0.0 - 1.0
AgentGoalAccuracyWithReferenceBinary metric: did the agent achieve the user's stated goal?0 or 1
TopicAdherenceScoreMeasures if the agent stayed on financial analysis topics throughout the conversation.0.0 - 1.0
Test Dataset

The evaluation test set consists of two sources:

  • Manual test cases (backend/test_data/manual_test_cases.json) — Hand-crafted question/answer/context triples covering all agent capabilities.
  • Synthetic test cases — Generated using RAGAS TestsetGenerator with a knowledge-graph approach over the business knowledge base documents.