Evals Dashboard
Task 5: RAGAS Evaluation Results
RAG Pipeline Metrics
Baseline evaluation over the manual test dataset using RAGAS framework with GPT-4o-mini as the evaluator LLM:
| Metric | Score | Description |
|---|---|---|
| LLMContextRecall | 0.55 | Relevant context retrieved |
| LLMContextPrecision | 0.85 | Retrieved context is precise and relevant |
| Faithfulness | 0.55 | Response grounded in context |
| FactualCorrectness | 1.00 | Facts are accurate |
| ResponseRelevancy | 0.79 | Response addresses the question |
| ContextEntityRecall | 0.37 | Key entities in retrieved docs |
| NoiseSensitivity | 0.00 | Robustness to irrelevant context |
Agent Metrics
Agent-level evaluation using RAGAS multi-turn metrics over traced agent conversations:
| Metric | Score | Description |
|---|---|---|
| ToolCallAccuracy | 0.90 | Correct tool + correct arguments |
| AgentGoalAccuracy | 0.80 | User goal achieved |
| TopicAdherenceScore | 0.95 | Stayed on financial topics |
Conclusions
The baseline results on 5,087 chunks across 9 documents show perfect factual correctness (1.00) and strong context precision (0.85). Context recall (0.55) and faithfulness (0.55) reveal room for improvement — the harder multi-hop and diagnostic questions pull down these averages because relevant information is spread across multiple chunks. Context entity recall (0.37) is a weaker metric, reflecting the challenge of surfacing all domain-specific entities from a large corpus. Noise sensitivity is 0.00, indicating the baseline is fully robust against irrelevant context. Agent tool call accuracy is high (0.90), confirming the playbook-driven approach effectively guides tool selection.
The test set includes both easy definitional questions (samples 1–10) and harder diagnostic/multi-step questions (samples 11–20), giving a realistic picture of production performance. The hybrid retrieval experiment (see Improvements) tests whether improved BM25 with NLTK tokenization, score thresholding, and asymmetric RRF weighting can improve retrieval on the harder questions.
LangSmith Traces
Full agent execution traces are available in LangSmith, showing every LLM call, tool invocation, and routing decision: