A hybrid RAG (Retrieval-Augmented Generation) system combining BM25 keyword search and vector embeddings for enhanced semantic retrieval on the HF CNN/DailyMail articles dataset.
- Hybrid Search: BM25 + vector embeddings with RRF/weighted fusion
- LLM-Generated Metadata: Automatic LLM keyword extraction per article
- Local Inference: Runs entirely on Ollama
- Flexible Retrieval: Choose between rrf, weighted, BM25-only, or vector-only search
The following should be considered. Each of these require research, implementation and tinkering.
- Benchmarking: Just as we have loss functions in ANN training, we need a means to measure improvement after every change to the system. A RAG system needs to be tailored to the application and knowledgebase at hand, and there are many steps in both indexing and retrieval. New methods are invented in research all the time so it's necessary to continually try these out for the potential performance improvements and cost savings with promising. RAGAS seems good to try.
- Remove Langchain. Chromadb and Ollama can be directly used through their Python libraries without getting into the extra complexity of Langchain's abstractions. If APIs will be used instead of local inference, the same applies.
- Set up a system for storing all user queries and responses, as well as user feedback to score the responses.
- The dataset is missing dates for the news articles. Including the dates in the metadata is necessary for a knowledgebase of news articles.
- Chunking: LangChain's RecursiveCharacterTextSplitter is good for splitting text into paragraphs, but it could be better to try Small2Big or LLM-powered chunking.
- Indexing: Addition of Knowledge Graph indexing alongside vectorization and BM25 would be good for queries where relationships between different entities are important.
- Query Classification: some queries may not need retrieval. Detecting these and routing them directly to the main generator LLM would save compute and token.
- Retrieval: HyDE (generating pseudo documents from the query and using that to enhance retrieval) could be worth a try
- Reranking: After producing the ranked context document list, there are various methods to check the produced documents using LLMs fine-tuned for this and rerank them as necessary.
- Repacking
- Summarization
- Ollama - Install from https://ollama.com
- Python 3.10+
- uv - Fast Python package installer
Run the setup script to automatically install all dependencies: Tested on Linux, but not on macOS.
./setup.shThis will:
- Install Ollama and uv (if needed)
- Pull required models
- Create virtual environment
- Install Python dependencies
- Download dataset
- Validate environment
If you prefer manual installation:
# 1. Install and start Ollama
ollama serve
# 2. Pull required models
ollama pull llama3.2
ollama pull nomic-embed-text
# 3. Create virtual environment and install dependencies
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
# 4. Configure environment (copy and edit .env)
cp .env.example .env
# Add your HF_TOKEN if needed
# 5. Download dataset
python data/getdata.py
# 6. Validate environment
python start.py setupChromaDB has an opt-out telemetry feature. In this version of ChromaDB, it's bugged so a bunch of error messages are generated every time it tries to use PostHog. The native way for disabling telemetry through an environment variable also does not work, so it's necessary to directly edit the source code after pip installing it.
Use this script to do the edit:
python fix_chromadb_telemetry.pyThe script modifies ChromaDB's source code in .venv/ to replace the call to PostHog with a pass.
# Test with 100 articles
python start.py consume --test
# Index 5000 articles (default)
python start.py consume
# Custom limit and batch size
python start.py consume --limit 1000 --batch-size 50This creates:
./chroma_db/- Vector store (ChromaDB)./bm25_index.pkl- BM25 keyword index
# Default (hybrid search with RRF fusion)
python start.py generate --query "What happened in the election?" --k 5
# Try different fusion methods
python start.py generate --query "climate change" --fusion weighted
python start.py generate --query "breaking news" --fusion bm25-only
python start.py generate --query "politics" --fusion vector-onlyFusion methods:
rrf- Reciprocal Rank Fusion (default, best for most cases)weighted- 30% BM25 + 70% vector (tunable)bm25-only- Pure keyword matchingvector-only- Pure semantic search
src/
├── config.py # Configuration management
├── data_loader.py # Dataset loading
├── chunking.py # Text chunking + keyword generation
├── indexing.py # ChromaDB vector indexing
├── bm25_index.py # BM25 keyword indexing
├── hybrid_search.py # Hybrid retrieval with fusion
├── query.py # RAG query interface
└── consumer.py # Indexing pipeline orchestration
tests/ # Pytest test suite
data/ # CNN/DailyMail dataset
Edit .env to customize:
OLLAMA_BASE_URL=http://localhost:11434
EMBEDDING_MODEL=nomic-embed-text
LLM_MODEL=llama3.2
CHUNK_SIZE=800
CHUNK_OVERLAP=200
DATASET_PATH=./data/cnn_dailymail
CHROMA_DB_PATH=./chroma_db# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test file
pytest tests/test_chunking.py# Delete indices and re-index
rm -rf ./chroma_db ./bm25_index.pkl
python start.py consume --testThis project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).
General RAG
ChromaDB
Hybrid Indexing and Retrieval
BM25 for Keyword Search
- https://medium.com/@kimdoil1211/bm25-for-developers-a-guide-to-smarter-keyword-search-e6d83e8c8c8c
- https://huggingface.co/blog/xhluca/bm25s (alternative library)
Tokenizers
