Evals Dashboard

Task 5: RAGAS Evaluation Results

RAG Pipeline Metrics

Baseline evaluation over the manual test dataset using RAGAS framework with GPT-4o-mini as the evaluator LLM:

Metric	Score	Description
LLMContextRecall	0.55	Relevant context retrieved
LLMContextPrecision	0.85	Retrieved context is precise and relevant
Faithfulness	0.55	Response grounded in context
FactualCorrectness	1.00	Facts are accurate
ResponseRelevancy	0.79	Response addresses the question
ContextEntityRecall	0.37	Key entities in retrieved docs
NoiseSensitivity	0.00	Robustness to irrelevant context

Agent Metrics

Agent-level evaluation using RAGAS multi-turn metrics over traced agent conversations:

Metric	Score	Description
ToolCallAccuracy	0.90	Correct tool + correct arguments
AgentGoalAccuracy	0.80	User goal achieved
TopicAdherenceScore	0.95	Stayed on financial topics

Conclusions

The baseline results on 5,087 chunks across 9 documents show perfect factual correctness (1.00) and strong context precision (0.85). Context recall (0.55) and faithfulness (0.55) reveal room for improvement — the harder multi-hop and diagnostic questions pull down these averages because relevant information is spread across multiple chunks. Context entity recall (0.37) is a weaker metric, reflecting the challenge of surfacing all domain-specific entities from a large corpus. Noise sensitivity is 0.00, indicating the baseline is fully robust against irrelevant context. Agent tool call accuracy is high (0.90), confirming the playbook-driven approach effectively guides tool selection.

The test set includes both easy definitional questions (samples 1–10) and harder diagnostic/multi-step questions (samples 11–20), giving a realistic picture of production performance. The hybrid retrieval experiment (see Improvements) tests whether improved BM25 with NLTK tokenization, score thresholding, and asymmetric RRF weighting can improve retrieval on the harder questions.

LangSmith Traces

Full agent execution traces are available in LangSmith, showing every LLM call, tool invocation, and routing decision:

Open LangSmith Dashboard