Vestr RAG System

A hybrid RAG (Retrieval-Augmented Generation) system combining BM25 keyword search and vector embeddings for enhanced semantic retrieval on the HF CNN/DailyMail articles dataset.

Features

Hybrid Search: BM25 + vector embeddings with RRF/weighted fusion
LLM-Generated Metadata: Automatic LLM keyword extraction per article
Local Inference: Runs entirely on Ollama
Flexible Retrieval: Choose between rrf, weighted, BM25-only, or vector-only search

Limitations - Future Direction

The following should be considered. Each of these require research, implementation and tinkering.

Benchmarking: Just as we have loss functions in ANN training, we need a means to measure improvement after every change to the system. A RAG system needs to be tailored to the application and knowledgebase at hand, and there are many steps in both indexing and retrieval. New methods are invented in research all the time so it's necessary to continually try these out for the potential performance improvements and cost savings with promising. RAGAS seems good to try.
Remove Langchain. Chromadb and Ollama can be directly used through their Python libraries without getting into the extra complexity of Langchain's abstractions. If APIs will be used instead of local inference, the same applies.
Set up a system for storing all user queries and responses, as well as user feedback to score the responses.
The dataset is missing dates for the news articles. Including the dates in the metadata is necessary for a knowledgebase of news articles.
Chunking: LangChain's RecursiveCharacterTextSplitter is good for splitting text into paragraphs, but it could be better to try Small2Big or LLM-powered chunking.
Indexing: Addition of Knowledge Graph indexing alongside vectorization and BM25 would be good for queries where relationships between different entities are important.
Query Classification: some queries may not need retrieval. Detecting these and routing them directly to the main generator LLM would save compute and token.
Retrieval: HyDE (generating pseudo documents from the query and using that to enhance retrieval) could be worth a try
Reranking: After producing the ranked context document list, there are various methods to check the produced documents using LLMs fine-tuned for this and rerank them as necessary.
Repacking
Summarization

Prerequisites

Ollama - Install from https://ollama.com
Python 3.10+
uv - Fast Python package installer

Setup

Automated Setup (Linux/macOS)

Run the setup script to automatically install all dependencies: Tested on Linux, but not on macOS.

./setup.sh

This will:

Install Ollama and uv (if needed)
Pull required models
Create virtual environment
Install Python dependencies
Download dataset
Validate environment

Manual Setup

If you prefer manual installation:

# 1. Install and start Ollama
ollama serve

# 2. Pull required models
ollama pull llama3.2
ollama pull nomic-embed-text

# 3. Create virtual environment and install dependencies
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt

# 4. Configure environment (copy and edit .env)
cp .env.example .env
# Add your HF_TOKEN if needed

# 5. Download dataset
python data/getdata.py

# 6. Validate environment
python start.py setup

ChromaDB Telemetry Issue

ChromaDB has an opt-out telemetry feature. In this version of ChromaDB, it's bugged so a bunch of error messages are generated every time it tries to use PostHog. The native way for disabling telemetry through an environment variable also does not work, so it's necessary to directly edit the source code after pip installing it.

Use this script to do the edit:

python fix_chromadb_telemetry.py

The script modifies ChromaDB's source code in .venv/ to replace the call to PostHog with a pass.

Usage

Index Data (Consume Mode)

# Test with 100 articles
python start.py consume --test

# Index 5000 articles (default)
python start.py consume

# Custom limit and batch size
python start.py consume --limit 1000 --batch-size 50

This creates:

./chroma_db/ - Vector store (ChromaDB)
./bm25_index.pkl - BM25 keyword index

Query (Generate Mode)

# Default (hybrid search with RRF fusion)
python start.py generate --query "What happened in the election?" --k 5

# Try different fusion methods
python start.py generate --query "climate change" --fusion weighted
python start.py generate --query "breaking news" --fusion bm25-only
python start.py generate --query "politics" --fusion vector-only

Fusion methods:

rrf - Reciprocal Rank Fusion (default, best for most cases)
weighted - 30% BM25 + 70% vector (tunable)
bm25-only - Pure keyword matching
vector-only - Pure semantic search

Project Structure

src/
├── config.py          # Configuration management
├── data_loader.py     # Dataset loading
├── chunking.py        # Text chunking + keyword generation
├── indexing.py        # ChromaDB vector indexing
├── bm25_index.py      # BM25 keyword indexing
├── hybrid_search.py   # Hybrid retrieval with fusion
├── query.py           # RAG query interface
└── consumer.py        # Indexing pipeline orchestration

tests/                 # Pytest test suite
data/                  # CNN/DailyMail dataset

Configuration

Edit .env to customize:

OLLAMA_BASE_URL=http://localhost:11434
EMBEDDING_MODEL=nomic-embed-text
LLM_MODEL=llama3.2
CHUNK_SIZE=800
CHUNK_OVERLAP=200
DATASET_PATH=./data/cnn_dailymail
CHROMA_DB_PATH=./chroma_db

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run specific test file
pytest tests/test_chunking.py

Reset Indices

# Delete indices and re-index
rm -rf ./chroma_db ./bm25_index.pkl
python start.py consume --test

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

References

General RAG

https://arxiv.org/abs/2407.01219

ChromaDB

https://docs.trychroma.com/docs/overview/getting-started

Hybrid Indexing and Retrieval

BM25 for Keyword Search

Tokenizers

https://github.com/huggingface/tokenizers

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
notebooks		notebooks
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
RAG_architecture.jpg		RAG_architecture.jpg
RAG_architecture.png		RAG_architecture.png
README.md		README.md
architecture.svg		architecture.svg
fix_chromadb_telemetry.py		fix_chromadb_telemetry.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.sh		setup.sh
start.py		start.py
vestr_rag_takehome.drawio		vestr_rag_takehome.drawio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vestr RAG System

Features

Limitations - Future Direction

Prerequisites

Setup

Automated Setup (Linux/macOS)

Manual Setup

ChromaDB Telemetry Issue

Usage

Index Data (Consume Mode)

Query (Generate Mode)

Project Structure

Configuration

Testing

Reset Indices

License

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vestr RAG System

Features

Limitations - Future Direction

Prerequisites

Setup

Automated Setup (Linux/macOS)

Manual Setup

ChromaDB Telemetry Issue

Usage

Index Data (Consume Mode)

Query (Generate Mode)

Project Structure

Configuration

Testing

Reset Indices

License

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages