Skip to content

cinardoruk/vestr_rag_takehome

Repository files navigation

Vestr RAG System

RAG_architecture.png

A hybrid RAG (Retrieval-Augmented Generation) system combining BM25 keyword search and vector embeddings for enhanced semantic retrieval on the HF CNN/DailyMail articles dataset.

Features

  • Hybrid Search: BM25 + vector embeddings with RRF/weighted fusion
  • LLM-Generated Metadata: Automatic LLM keyword extraction per article
  • Local Inference: Runs entirely on Ollama
  • Flexible Retrieval: Choose between rrf, weighted, BM25-only, or vector-only search

Limitations - Future Direction

The following should be considered. Each of these require research, implementation and tinkering.

  • Benchmarking: Just as we have loss functions in ANN training, we need a means to measure improvement after every change to the system. A RAG system needs to be tailored to the application and knowledgebase at hand, and there are many steps in both indexing and retrieval. New methods are invented in research all the time so it's necessary to continually try these out for the potential performance improvements and cost savings with promising. RAGAS seems good to try.
  • Remove Langchain. Chromadb and Ollama can be directly used through their Python libraries without getting into the extra complexity of Langchain's abstractions. If APIs will be used instead of local inference, the same applies.
  • Set up a system for storing all user queries and responses, as well as user feedback to score the responses.
  • The dataset is missing dates for the news articles. Including the dates in the metadata is necessary for a knowledgebase of news articles.
  • Chunking: LangChain's RecursiveCharacterTextSplitter is good for splitting text into paragraphs, but it could be better to try Small2Big or LLM-powered chunking.
  • Indexing: Addition of Knowledge Graph indexing alongside vectorization and BM25 would be good for queries where relationships between different entities are important.
  • Query Classification: some queries may not need retrieval. Detecting these and routing them directly to the main generator LLM would save compute and token.
  • Retrieval: HyDE (generating pseudo documents from the query and using that to enhance retrieval) could be worth a try
  • Reranking: After producing the ranked context document list, there are various methods to check the produced documents using LLMs fine-tuned for this and rerank them as necessary.
  • Repacking
  • Summarization

Prerequisites

  1. Ollama - Install from https://ollama.com
  2. Python 3.10+
  3. uv - Fast Python package installer

Setup

Automated Setup (Linux/macOS)

Run the setup script to automatically install all dependencies: Tested on Linux, but not on macOS.

./setup.sh

This will:

  • Install Ollama and uv (if needed)
  • Pull required models
  • Create virtual environment
  • Install Python dependencies
  • Download dataset
  • Validate environment

Manual Setup

If you prefer manual installation:

# 1. Install and start Ollama
ollama serve

# 2. Pull required models
ollama pull llama3.2
ollama pull nomic-embed-text

# 3. Create virtual environment and install dependencies
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt

# 4. Configure environment (copy and edit .env)
cp .env.example .env
# Add your HF_TOKEN if needed

# 5. Download dataset
python data/getdata.py

# 6. Validate environment
python start.py setup

ChromaDB Telemetry Issue

ChromaDB has an opt-out telemetry feature. In this version of ChromaDB, it's bugged so a bunch of error messages are generated every time it tries to use PostHog. The native way for disabling telemetry through an environment variable also does not work, so it's necessary to directly edit the source code after pip installing it.

Use this script to do the edit:

python fix_chromadb_telemetry.py

The script modifies ChromaDB's source code in .venv/ to replace the call to PostHog with a pass.

Usage

Index Data (Consume Mode)

# Test with 100 articles
python start.py consume --test

# Index 5000 articles (default)
python start.py consume

# Custom limit and batch size
python start.py consume --limit 1000 --batch-size 50

This creates:

  • ./chroma_db/ - Vector store (ChromaDB)
  • ./bm25_index.pkl - BM25 keyword index

Query (Generate Mode)

# Default (hybrid search with RRF fusion)
python start.py generate --query "What happened in the election?" --k 5

# Try different fusion methods
python start.py generate --query "climate change" --fusion weighted
python start.py generate --query "breaking news" --fusion bm25-only
python start.py generate --query "politics" --fusion vector-only

Fusion methods:

  • rrf - Reciprocal Rank Fusion (default, best for most cases)
  • weighted - 30% BM25 + 70% vector (tunable)
  • bm25-only - Pure keyword matching
  • vector-only - Pure semantic search

Project Structure

src/
├── config.py          # Configuration management
├── data_loader.py     # Dataset loading
├── chunking.py        # Text chunking + keyword generation
├── indexing.py        # ChromaDB vector indexing
├── bm25_index.py      # BM25 keyword indexing
├── hybrid_search.py   # Hybrid retrieval with fusion
├── query.py           # RAG query interface
└── consumer.py        # Indexing pipeline orchestration

tests/                 # Pytest test suite
data/                  # CNN/DailyMail dataset

Configuration

Edit .env to customize:

OLLAMA_BASE_URL=http://localhost:11434
EMBEDDING_MODEL=nomic-embed-text
LLM_MODEL=llama3.2
CHUNK_SIZE=800
CHUNK_OVERLAP=200
DATASET_PATH=./data/cnn_dailymail
CHROMA_DB_PATH=./chroma_db

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run specific test file
pytest tests/test_chunking.py

Reset Indices

# Delete indices and re-index
rm -rf ./chroma_db ./bm25_index.pkl
python start.py consume --test

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

References

General RAG

ChromaDB

Hybrid Indexing and Retrieval

BM25 for Keyword Search

Tokenizers

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors