A production-grade project showcasing a full retrieval-augmented generation (RAG) system with a semantic cache for language model APIs. It demonstrates:
- end-to-end document ingestion and preprocessing,
- indexing workflows that chunk, embed, and store text in a vector database,
- a FastAPI-based inference service with optional two-layer Redis caching, semantic retrieval, and pluggable LLM backends,
- quantization utilities for optimizing local models,
- and a robust observability stack (metrics, logs, tracing) paired with Docker Compose orchestration.
Development is centered around modular Python packages under src/ that correspond to each major functional area.
This codebase is built to showcase a robust, scalable architecture for caching and serving embeddings and LLM responses with features such as:
- Document ingestion with cleaning and chunking pipelines.
- Semantic indexing using vector stores and Redis backends.
- Retrieval-augmented generation (RAG) workflows.
- Inference server with pluggable backends (OpenAI, local models, etc.).
- Quantization utilities for model optimization.
- Observability via Prometheus, Grafana, and Loki.
- Comprehensive configuration and environment handling.
The repository is organized into the following high-level packages:
src/
cache/ # Core cache and client logic
indexing/ # Chunking, embedding, validation, and vector store code
inference/ # API server and backend abstractions
ingestion/ # Data ingestion pipelines (cleaners and fetchers)
observability/ # Logging, metrics, tracing support
quantization/ # Model quantization and benchmarking
rag/ # Retrieval-augmented generation utilities
Additional directories like data/, docker-compose.yml, and pyproject.toml support deployment, testing, and packaging.
- Python 3.10+
poetryor virtual environment manager- Redis (for caching and vector store)
uv syncor with pip:
python -m venv .venv
. .venv/bin/activate
pip install .Copy src/config.py or environment examples and set your API keys, Redis URLs, and other settings.
Use the provided docker-compose.yml for local development:
docker-compose up --buildThis will start the inference server, Redis instance, and observability stack.
Index documents with the CLI:
python -m src.indexing --helpQuery the inference API at http://localhost:8000 using the inference.app endpoints.
The repository is organized in functional layers. Below is a clearer explanation of each component and how data flows through the system.
- Fetchers clone or download raw documentation (git, HTTP, etc.) into
data/. - Loaders (LlamaIndex
SimpleDirectoryReader) read files from the raw directory according to the source configuration. - Cleaners apply format-specific sanitization (e.g. strip Hugo frontmatter, remove code fences) and drop excluded paths.
- Processed documents are serialized as JSON under
processed-data/<source>/.
The pipeline is driven by src/ingestion/pipeline.py which orchestrates these
steps and offers a CLI.
- Validator reads the JSON documents and verifies metadata/format.
- Chunker splits long texts into token-limited chunks (configurable size and overlap).
- Embedder converts each chunk to a dense vector with a SentenceTransformer model.
- Milvus client (
src/indexing/vector_store.py) batch-inserts chunks with metadata into a Milvus collection configured for HNSW semantic search.
A CLI in src/indexing/__main__.py kicks off run_indexing_pipeline() and
can optionally drop + rebuild the collection.
The FastAPI app drives the user-facing API and encapsulates the core
RAG/LLM workflow (see src/inference/app.py):
- Cache lookup – optional Redis Search layer combining BM25 keyword and vector KNN. On a hit the service returns cached answers with source metadata and metrics.
- Retriever – if RAG is enabled, the
RetrieverClientqueries Milvus for semantically similar chunks, then budgets context tokens viasrc/rag/context.py. - Prompt builder – assembles few‑shot examples and selected chunks
according to
src/rag/prompt_builder.py. - LLM backend – pluggable implementations live under
src/inference/backends(local transformers, OpenAI, Anthropic, Google). - Response caching – after generation the result is stored in Redis for faster future responses.
- Metrics & tracing – every step emits Prometheus metrics, logs, and
OpenTelemetry traces via
src/observability.
Additional endpoints provide health checks (/health), metrics (/metrics),
and a static web UI served from src/inference/ui/static/index.html.
src/cache– Redis cache client plus configuration dataclasses.src/rag– utilities for retrieval context management and prompt templates. Also holdscontext.pyandprompt_builder.py.src/quantization– tools to quantize local models and benchmark them.src/observability– logging config, middleware, metrics definitions, and tracing helpers.
A docker-compose.yml defines services for the inference API, Redis,
Milvus (or alternative vector store), and the observability stack (Prometheus,
Grafana, Loki). Dependencies and formatting rules are declared in
pyproject.toml (with ruff for linting/formatting).
flowchart LR
subgraph ingestion [Ingestion Pipeline]
RA[Raw docs] --> FE[Fetchers]
FE --> LD[Loader LlamaIndex]
LD --> CL[Cleaners]
CL --> PD[Processed JSON]
end
subgraph indexing [Indexing Pipeline]
PD --> VA[Validator]
VA --> CH[Chunker]
CH --> EM[Embedder]
EM --> MS[Milvus / Vector Store]
end
subgraph inference [Inference Service]
UI[Web UI / Clients] --> AP[FastAPI / /infer endpoint]
AP --> CA{Redis Cache}
CA -- hit --> RE[Return cached answer]
CA -- miss --> RT[RAG Retriever]
RT --> MS
RT --> CB[Context builder]
CB --> LB[LLM Backend]
LB --> AP
end
subgraph observability [Observability]
AP --> PM[Prometheus]
AP --> LO[Loki]
AP --> TR[OpenTelemetry]
PM --> GD[Grafana]
end
Legend: arrows represent primary data/control flows.
- Language: Python 3.12
- Web framework: FastAPI + Uvicorn
- Vector store: Milvus (backed by Redis for cache)
- Embeddings: HuggingFace & sentence-transformers
- LLM backends: local (transformers), OpenAI, Anthropic, Google
- Observability: Prometheus, Grafana, Loki, OpenTelemetry
- Deployment: Docker Compose for local dev
- Configuration: environment variables,
pydanticsettings
Code lives under src/ with subpackages for each feature area.
This project is licensed under the MIT License.