Skip to content

noor-malaika/raiq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAIQ

A production-grade project showcasing a full retrieval-augmented generation (RAG) system with a semantic cache for language model APIs. It demonstrates:

  • end-to-end document ingestion and preprocessing,
  • indexing workflows that chunk, embed, and store text in a vector database,
  • a FastAPI-based inference service with optional two-layer Redis caching, semantic retrieval, and pluggable LLM backends,
  • quantization utilities for optimizing local models,
  • and a robust observability stack (metrics, logs, tracing) paired with Docker Compose orchestration.

Development is centered around modular Python packages under src/ that correspond to each major functional area.

Project Overview

This codebase is built to showcase a robust, scalable architecture for caching and serving embeddings and LLM responses with features such as:

  • Document ingestion with cleaning and chunking pipelines.
  • Semantic indexing using vector stores and Redis backends.
  • Retrieval-augmented generation (RAG) workflows.
  • Inference server with pluggable backends (OpenAI, local models, etc.).
  • Quantization utilities for model optimization.
  • Observability via Prometheus, Grafana, and Loki.
  • Comprehensive configuration and environment handling.

Repository Structure

The repository is organized into the following high-level packages:

src/
  cache/             # Core cache and client logic
  indexing/          # Chunking, embedding, validation, and vector store code
  inference/         # API server and backend abstractions
  ingestion/         # Data ingestion pipelines (cleaners and fetchers)
  observability/     # Logging, metrics, tracing support
  quantization/      # Model quantization and benchmarking
  rag/               # Retrieval-augmented generation utilities

Additional directories like data/, docker-compose.yml, and pyproject.toml support deployment, testing, and packaging.

Getting Started

Prerequisites

  • Python 3.10+
  • poetry or virtual environment manager
  • Redis (for caching and vector store)

Install

uv sync

or with pip:

python -m venv .venv
. .venv/bin/activate
pip install .

Configuration

Copy src/config.py or environment examples and set your API keys, Redis URLs, and other settings.

Running the Services

Use the provided docker-compose.yml for local development:

docker-compose up --build

This will start the inference server, Redis instance, and observability stack.

Usage

Index documents with the CLI:

python -m src.indexing --help

Query the inference API at http://localhost:8000 using the inference.app endpoints.

Architecture Overview

The repository is organized in functional layers. Below is a clearer explanation of each component and how data flows through the system.

1. Ingestion pipeline (src/ingestion)

  • Fetchers clone or download raw documentation (git, HTTP, etc.) into data/.
  • Loaders (LlamaIndex SimpleDirectoryReader) read files from the raw directory according to the source configuration.
  • Cleaners apply format-specific sanitization (e.g. strip Hugo frontmatter, remove code fences) and drop excluded paths.
  • Processed documents are serialized as JSON under processed-data/<source>/.

The pipeline is driven by src/ingestion/pipeline.py which orchestrates these steps and offers a CLI.

2. Indexing pipeline (src/indexing)

  • Validator reads the JSON documents and verifies metadata/format.
  • Chunker splits long texts into token-limited chunks (configurable size and overlap).
  • Embedder converts each chunk to a dense vector with a SentenceTransformer model.
  • Milvus client (src/indexing/vector_store.py) batch-inserts chunks with metadata into a Milvus collection configured for HNSW semantic search.

A CLI in src/indexing/__main__.py kicks off run_indexing_pipeline() and can optionally drop + rebuild the collection.

3. Inference service (src/inference)

The FastAPI app drives the user-facing API and encapsulates the core RAG/LLM workflow (see src/inference/app.py):

  1. Cache lookup – optional Redis Search layer combining BM25 keyword and vector KNN. On a hit the service returns cached answers with source metadata and metrics.
  2. Retriever – if RAG is enabled, the RetrieverClient queries Milvus for semantically similar chunks, then budgets context tokens via src/rag/context.py.
  3. Prompt builder – assembles few‑shot examples and selected chunks according to src/rag/prompt_builder.py.
  4. LLM backend – pluggable implementations live under src/inference/backends (local transformers, OpenAI, Anthropic, Google).
  5. Response caching – after generation the result is stored in Redis for faster future responses.
  6. Metrics & tracing – every step emits Prometheus metrics, logs, and OpenTelemetry traces via src/observability.

Additional endpoints provide health checks (/health), metrics (/metrics), and a static web UI served from src/inference/ui/static/index.html.

4. Supporting packages

  • src/cache – Redis cache client plus configuration dataclasses.
  • src/rag – utilities for retrieval context management and prompt templates. Also holds context.py and prompt_builder.py.
  • src/quantization – tools to quantize local models and benchmark them.
  • src/observability – logging config, middleware, metrics definitions, and tracing helpers.

5. Deployment & tooling

A docker-compose.yml defines services for the inference API, Redis, Milvus (or alternative vector store), and the observability stack (Prometheus, Grafana, Loki). Dependencies and formatting rules are declared in pyproject.toml (with ruff for linting/formatting).

Data flow diagram

flowchart LR
    subgraph ingestion [Ingestion Pipeline]
        RA[Raw docs] --> FE[Fetchers]
        FE --> LD[Loader LlamaIndex]
        LD --> CL[Cleaners]
        CL --> PD[Processed JSON]
    end

    subgraph indexing [Indexing Pipeline]
        PD --> VA[Validator]
        VA --> CH[Chunker]
        CH --> EM[Embedder]
        EM --> MS[Milvus / Vector Store]
    end

    subgraph inference [Inference Service]
        UI[Web UI / Clients] --> AP[FastAPI / /infer endpoint]
        AP --> CA{Redis Cache}
        CA -- hit --> RE[Return cached answer]
        CA -- miss --> RT[RAG Retriever]
        RT --> MS
        RT --> CB[Context builder]
        CB --> LB[LLM Backend]
        LB --> AP
    end

    subgraph observability [Observability]
        AP --> PM[Prometheus]
        AP --> LO[Loki]
        AP --> TR[OpenTelemetry]
        PM --> GD[Grafana]
    end
Loading

Legend: arrows represent primary data/control flows.

Technology & Tools

  • Language: Python 3.12
  • Web framework: FastAPI + Uvicorn
  • Vector store: Milvus (backed by Redis for cache)
  • Embeddings: HuggingFace & sentence-transformers
  • LLM backends: local (transformers), OpenAI, Anthropic, Google
  • Observability: Prometheus, Grafana, Loki, OpenTelemetry
  • Deployment: Docker Compose for local dev
  • Configuration: environment variables, pydantic settings

Code lives under src/ with subpackages for each feature area.

License

This project is licensed under the MIT License.

About

Production-grade RAG pipeline with semantic caching for LLM APIs. Includes well structured submodules from ingestion to all the way of observability.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors