Improvements

Task 6: Advanced Retrieval and Evaluation-Based Improvements

Advanced Retrieval Technique: Hybrid Search with Reciprocal Rank Fusion

We implement hybrid search combining dense vector retrieval (Qdrant cosine similarity) with BM25 sparse retrieval, fused using Reciprocal Rank Fusion (RRF) with three key improvements over naive hybrid search. The retrieval mode is controlled by the ADVANCED_RETRIEVALenvironment variable, enabling A/B evaluation between baseline dense-only and the improved hybrid pipeline.

NLTK tokenization — Instead of naive whitespace splitting, BM25 uses regex-based tokenization with Porter stemming and English stop-word removal. This lets “optimize” and “optimization” match correctly and prevents common words from flooding BM25 scores in a domain-specific financial corpus.
BM25 score thresholding — Only BM25 candidates scoring above mean + 1 standard deviation are included, filtering out low-quality keyword matches that would otherwise dilute precision.
Asymmetric RRF (dense 1.5×) — Dense retrieval scores are weighted 1.5× in the RRF fusion, reflecting that semantic similarity is the stronger signal for this domain. BM25 supplements rather than competes.

Before / After Comparison

RAGAS metrics comparison between baseline dense-only retrieval and improved hybrid (BM25 + Dense + RRF):

Metric	Before (Dense Only)	After (Hybrid + RRF)	Delta
LLMContextRecall	0.55	0.60	+0.06
LLMContextPrecision	0.85	0.90	+0.05
Faithfulness	0.55	0.55	0.00
FactualCorrectness	1.00	1.00	0.00
ResponseRelevancy	0.79	0.79	0.00
ContextEntityRecall	0.37	0.40	+0.03
NoiseSensitivity	0.00	0.07	+0.07

Implementation Details

The hybrid retrieval is implemented in backend/agents/rag_pipeline.py and controlled by the ADVANCED_RETRIEVAL environment variable:

BM25 index — Built over the same chunked documents during ingestion using rank_bm25.BM25Okapi with NLTK-powered tokenization (Porter stemmer, English stop-word removal, regex punctuation stripping).
Score-gated candidates — BM25 candidates below mean + 1σ are filtered out. Dense retrieves 2×k candidates; only high-scoring BM25 results are fused, preventing low-quality keyword matches from diluting precision.
Asymmetric RRF (k=60, dense×1.5) — Dense scores are multiplied by 1.5 in the RRF formula, giving semantic similarity the dominant weight while BM25 acts as a precision supplement for exact-term matches.
A/B toggle — Set ADVANCED_RETRIEVAL=true to enable improved hybrid mode, or false for baseline dense-only. Each evaluation run is registered as a unique LangSmith experiment for traceability.