RAIQ

A production-grade project showcasing a full retrieval-augmented generation (RAG) system with a semantic cache for language model APIs. It demonstrates:

end-to-end document ingestion and preprocessing,
indexing workflows that chunk, embed, and store text in a vector database,
a FastAPI-based inference service with optional two-layer Redis caching, semantic retrieval, and pluggable LLM backends,
quantization utilities for optimizing local models,
and a robust observability stack (metrics, logs, tracing) paired with Docker Compose orchestration.

Development is centered around modular Python packages under src/ that correspond to each major functional area.

Project Overview

This codebase is built to showcase a robust, scalable architecture for caching and serving embeddings and LLM responses with features such as:

Document ingestion with cleaning and chunking pipelines.
Semantic indexing using vector stores and Redis backends.
Retrieval-augmented generation (RAG) workflows.
Inference server with pluggable backends (OpenAI, local models, etc.).
Quantization utilities for model optimization.
Observability via Prometheus, Grafana, and Loki.
Comprehensive configuration and environment handling.

Repository Structure

The repository is organized into the following high-level packages:

src/
  cache/             # Core cache and client logic
  indexing/          # Chunking, embedding, validation, and vector store code
  inference/         # API server and backend abstractions
  ingestion/         # Data ingestion pipelines (cleaners and fetchers)
  observability/     # Logging, metrics, tracing support
  quantization/      # Model quantization and benchmarking
  rag/               # Retrieval-augmented generation utilities

Additional directories like data/, docker-compose.yml, and pyproject.toml support deployment, testing, and packaging.

Getting Started

Prerequisites

Python 3.10+
poetry or virtual environment manager
Redis (for caching and vector store)

Install

uv sync

or with pip:

python -m venv .venv
. .venv/bin/activate
pip install .

Configuration

Copy src/config.py or environment examples and set your API keys, Redis URLs, and other settings.

Running the Services

Use the provided docker-compose.yml for local development:

docker-compose up --build

This will start the inference server, Redis instance, and observability stack.

Usage

Index documents with the CLI:

python -m src.indexing --help

Query the inference API at http://localhost:8000 using the inference.app endpoints.

Architecture Overview

The repository is organized in functional layers. Below is a clearer explanation of each component and how data flows through the system.

1. Ingestion pipeline (`src/ingestion`)

Fetchers clone or download raw documentation (git, HTTP, etc.) into data/.
Loaders (LlamaIndex SimpleDirectoryReader) read files from the raw directory according to the source configuration.
Cleaners apply format-specific sanitization (e.g. strip Hugo frontmatter, remove code fences) and drop excluded paths.
Processed documents are serialized as JSON under processed-data/<source>/.

The pipeline is driven by src/ingestion/pipeline.py which orchestrates these steps and offers a CLI.

2. Indexing pipeline (`src/indexing`)

Validator reads the JSON documents and verifies metadata/format.
Chunker splits long texts into token-limited chunks (configurable size and overlap).
Embedder converts each chunk to a dense vector with a SentenceTransformer model.
Milvus client (src/indexing/vector_store.py) batch-inserts chunks with metadata into a Milvus collection configured for HNSW semantic search.

A CLI in src/indexing/__main__.py kicks off run_indexing_pipeline() and can optionally drop + rebuild the collection.

3. Inference service (`src/inference`)

The FastAPI app drives the user-facing API and encapsulates the core RAG/LLM workflow (see src/inference/app.py):

Cache lookup – optional Redis Search layer combining BM25 keyword and vector KNN. On a hit the service returns cached answers with source metadata and metrics.
Retriever – if RAG is enabled, the RetrieverClient queries Milvus for semantically similar chunks, then budgets context tokens via src/rag/context.py.
Prompt builder – assembles few‑shot examples and selected chunks according to src/rag/prompt_builder.py.
LLM backend – pluggable implementations live under src/inference/backends (local transformers, OpenAI, Anthropic, Google).
Response caching – after generation the result is stored in Redis for faster future responses.
Metrics & tracing – every step emits Prometheus metrics, logs, and OpenTelemetry traces via src/observability.

Additional endpoints provide health checks (/health), metrics (/metrics), and a static web UI served from src/inference/ui/static/index.html.

4. Supporting packages

src/cache – Redis cache client plus configuration dataclasses.
src/rag – utilities for retrieval context management and prompt templates. Also holds context.py and prompt_builder.py.
src/quantization – tools to quantize local models and benchmark them.
src/observability – logging config, middleware, metrics definitions, and tracing helpers.

5. Deployment & tooling

A docker-compose.yml defines services for the inference API, Redis, Milvus (or alternative vector store), and the observability stack (Prometheus, Grafana, Loki). Dependencies and formatting rules are declared in pyproject.toml (with ruff for linting/formatting).

Data flow diagram

flowchart LR
    subgraph ingestion [Ingestion Pipeline]
        RA[Raw docs] --> FE[Fetchers]
        FE --> LD[Loader LlamaIndex]
        LD --> CL[Cleaners]
        CL --> PD[Processed JSON]
    end

    subgraph indexing [Indexing Pipeline]
        PD --> VA[Validator]
        VA --> CH[Chunker]
        CH --> EM[Embedder]
        EM --> MS[Milvus / Vector Store]
    end

    subgraph inference [Inference Service]
        UI[Web UI / Clients] --> AP[FastAPI / /infer endpoint]
        AP --> CA{Redis Cache}
        CA -- hit --> RE[Return cached answer]
        CA -- miss --> RT[RAG Retriever]
        RT --> MS
        RT --> CB[Context builder]
        CB --> LB[LLM Backend]
        LB --> AP
    end

    subgraph observability [Observability]
        AP --> PM[Prometheus]
        AP --> LO[Loki]
        AP --> TR[OpenTelemetry]
        PM --> GD[Grafana]
    end

Legend: arrows represent primary data/control flows.

Technology & Tools

Language: Python 3.12
Web framework: FastAPI + Uvicorn
Vector store: Milvus (backed by Redis for cache)
Embeddings: HuggingFace & sentence-transformers
LLM backends: local (transformers), OpenAI, Anthropic, Google
Observability: Prometheus, Grafana, Loki, OpenTelemetry
Deployment: Docker Compose for local dev
Configuration: environment variables, pydantic settings

Code lives under src/ with subpackages for each feature area.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
observability		observability
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAIQ

Project Overview

Repository Structure

Getting Started

Prerequisites

Install

Configuration

Running the Services

Usage

Architecture Overview

1. Ingestion pipeline (`src/ingestion`)

2. Indexing pipeline (`src/indexing`)

3. Inference service (`src/inference`)

4. Supporting packages

5. Deployment & tooling

Data flow diagram

Technology & Tools

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAIQ

Project Overview

Repository Structure

Getting Started

Prerequisites

Install

Configuration

Running the Services

Usage

Architecture Overview

1. Ingestion pipeline (src/ingestion)

2. Indexing pipeline (src/indexing)

3. Inference service (src/inference)

4. Supporting packages

5. Deployment & tooling

Data flow diagram

Technology & Tools

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Ingestion pipeline (`src/ingestion`)

2. Indexing pipeline (`src/indexing`)

3. Inference service (`src/inference`)

Packages