Dy
Dysprosium Financial Assistant

Evals Dashboard

Task 5: RAGAS Evaluation Results
RAG Pipeline Metrics

Baseline evaluation over the manual test dataset using RAGAS framework with GPT-4o-mini as the evaluator LLM:

MetricScoreDescription
LLMContextRecall0.55Relevant context retrieved
LLMContextPrecision0.85Retrieved context is precise and relevant
Faithfulness0.55Response grounded in context
FactualCorrectness1.00Facts are accurate
ResponseRelevancy0.79Response addresses the question
ContextEntityRecall0.37Key entities in retrieved docs
NoiseSensitivity0.00Robustness to irrelevant context
Agent Metrics

Agent-level evaluation using RAGAS multi-turn metrics over traced agent conversations:

MetricScoreDescription
ToolCallAccuracy0.90Correct tool + correct arguments
AgentGoalAccuracy0.80User goal achieved
TopicAdherenceScore0.95Stayed on financial topics
Conclusions

The baseline results on 5,087 chunks across 9 documents show perfect factual correctness (1.00) and strong context precision (0.85). Context recall (0.55) and faithfulness (0.55) reveal room for improvement — the harder multi-hop and diagnostic questions pull down these averages because relevant information is spread across multiple chunks. Context entity recall (0.37) is a weaker metric, reflecting the challenge of surfacing all domain-specific entities from a large corpus. Noise sensitivity is 0.00, indicating the baseline is fully robust against irrelevant context. Agent tool call accuracy is high (0.90), confirming the playbook-driven approach effectively guides tool selection.

The test set includes both easy definitional questions (samples 1–10) and harder diagnostic/multi-step questions (samples 11–20), giving a realistic picture of production performance. The hybrid retrieval experiment (see Improvements) tests whether improved BM25 with NLTK tokenization, score thresholding, and asymmetric RRF weighting can improve retrieval on the harder questions.

LangSmith Traces

Full agent execution traces are available in LangSmith, showing every LLM call, tool invocation, and routing decision:

Open LangSmith Dashboard