Improvements
Task 6: Advanced Retrieval and Evaluation-Based Improvements
Advanced Retrieval Technique: Hybrid Search with Reciprocal Rank Fusion
We implement hybrid search combining dense vector retrieval (Qdrant cosine similarity) with BM25 sparse retrieval, fused using Reciprocal Rank Fusion (RRF) with three key improvements over naive hybrid search. The retrieval mode is controlled by the ADVANCED_RETRIEVALenvironment variable, enabling A/B evaluation between baseline dense-only and the improved hybrid pipeline.
- NLTK tokenization — Instead of naive whitespace splitting, BM25 uses regex-based tokenization with Porter stemming and English stop-word removal. This lets “optimize” and “optimization” match correctly and prevents common words from flooding BM25 scores in a domain-specific financial corpus.
- BM25 score thresholding — Only BM25 candidates scoring above mean + 1 standard deviation are included, filtering out low-quality keyword matches that would otherwise dilute precision.
- Asymmetric RRF (dense 1.5×) — Dense retrieval scores are weighted 1.5× in the RRF fusion, reflecting that semantic similarity is the stronger signal for this domain. BM25 supplements rather than competes.
Before / After Comparison
RAGAS metrics comparison between baseline dense-only retrieval and improved hybrid (BM25 + Dense + RRF):
| Metric | Before (Dense Only) | After (Hybrid + RRF) | Delta |
|---|---|---|---|
| LLMContextRecall | 0.55 | 0.60 | +0.06 |
| LLMContextPrecision | 0.85 | 0.90 | +0.05 |
| Faithfulness | 0.55 | 0.55 | 0.00 |
| FactualCorrectness | 1.00 | 1.00 | 0.00 |
| ResponseRelevancy | 0.79 | 0.79 | 0.00 |
| ContextEntityRecall | 0.37 | 0.40 | +0.03 |
| NoiseSensitivity | 0.00 | 0.07 | +0.07 |
Implementation Details
The hybrid retrieval is implemented in backend/agents/rag_pipeline.py and controlled by the ADVANCED_RETRIEVAL environment variable:
- BM25 index — Built over the same chunked documents during ingestion using
rank_bm25.BM25Okapiwith NLTK-powered tokenization (Porter stemmer, English stop-word removal, regex punctuation stripping). - Score-gated candidates — BM25 candidates below mean + 1σ are filtered out. Dense retrieves 2×k candidates; only high-scoring BM25 results are fused, preventing low-quality keyword matches from diluting precision.
- Asymmetric RRF (k=60, dense×1.5) — Dense scores are multiplied by 1.5 in the RRF formula, giving semantic similarity the dominant weight while BM25 acts as a precision supplement for exact-term matches.
- A/B toggle — Set
ADVANCED_RETRIEVAL=trueto enable improved hybrid mode, orfalsefor baseline dense-only. Each evaluation run is registered as a unique LangSmith experiment for traceability.