A Python daemon that watches one or more local folders and continuously ingests documents into a Vectorize Hindsight instance. Supports every common document format, Tesseract OCR fallback for images, and a configurable scan interval.
Watched Folders → Scanner → Extractor → Hindsight API
↑
Manifest DB (SQLite) ← change detection / idempotency
↑
APScheduler ← configurable interval
| Extension | Primary path | Fallback |
|---|---|---|
.txt .md .log .csv |
Hindsight native upload | Plain text read |
.pdf |
Hindsight native upload | PyMuPDF text extraction |
.docx |
Hindsight native upload | python-docx extraction |
.pptx |
Hindsight native upload | python-pptx extraction |
.xlsx |
Local openpyxl extraction | — |
.png .jpg .jpeg .tiff .bmp .gif |
Hindsight native upload (server OCR) | Tesseract OCR |
- Python 3.11+
- A running Vectorize Hindsight instance (Docker or cloud)
- Tesseract OCR (optional — only needed as image fallback)
docker run -d -p 8888:8888 -p 9999:9999 vectorize/hindsight:latestControl plane UI: http://localhost:9999
git clone https://github.com/winoiknow/hindsight-scan-ingest.git
cd hindsight-scan-ingest
bash install.shThe installer will:
- Check Python 3.11+ and create a virtual environment
- Install all dependencies
- Walk you through every
config.yamlsetting interactively - Test the connection to your Hindsight server
- Optionally register a systemd user service that starts the daemon automatically on login or boot
git clone https://github.com/winoiknow/hindsight-scan-ingest.git
cd hindsight-scan-ingest
pip install -r requirements.txt
cp config.yaml config.yaml # edit to tasteOnly needed as a fallback for image OCR if the Hindsight server-side OCR fails.
# Ubuntu / Debian
sudo apt install tesseract-ocr
# macOS
brew install tesseractCopy and edit config.yaml:
# Hindsight server
server_url: http://localhost:8888
api_key: "" # blank = unauthenticated (local Docker)
# set for cloud deployments
# Memory routing — must match what the target agent reads
bank_id: "documents"
source: "document-ingest" # context label; shapes Hindsight fact extraction
session: "" # optional session tag stored in memory metadata
# Folders to watch
folders:
- /path/to/your/docs
# - /another/folder
# Scan interval (minutes)
scan_interval_minutes: 15
# Local chunking — off by default (Hindsight chunks server-side)
local_chunking_enabled: false
chunk_size_tokens: 1000
chunk_overlap_tokens: 100
# Supported extensions (edit to restrict or extend)
supported_extensions:
- .txt
- .md
- .pdf
- .docx
- .xlsx
# … etc.A bank is Hindsight's top-level memory namespace. Every memory retained into bank_id: "sales-agent" is completely invisible to a query against bank_id: "support-agent". Matching the bank ID between this daemon and the agent that reads memories is what connects documents to an agent's knowledge.
Web UI — open the Hindsight control plane at http://localhost:9999 and browse the Banks tab. Each bank shows its profile, memory count, and last-updated time.
REST API — query directly:
curl http://localhost:8888/v1/default/banks | python3 -m json.toolEach entry in the response includes the bank_id, agent profile/disposition, and statistics.
Check stats for a specific bank:
curl http://localhost:8888/v1/default/banks/my-agent/stats | python3 -m json.toolBanks are created automatically the first time you retain a memory into them — no separate provisioning step needed. To create a fresh bank for an agent:
# Seed the bank with your first document batch
python main.py --bank-id my-agent --once --folder /path/to/docsThis creates my-agent on first write and populates it. The bank is immediately queryable by any agent or MCP tool that references the same bank ID.
- Ingest daemon — set
bank_id: "my-agent"inconfig.yaml(or pass--bank-id my-agent). - Claude Code agent — in the Hindsight plugin settings, set the bank ID to match. The agent will then recall only memories that were ingested into that bank.
To confirm the bank is populated:
curl "http://localhost:8888/v1/default/banks/my-agent/documents" | python3 -m json.toolA non-empty documents array confirms Hindsight received and processed the ingested files.
| Pattern | bank_id | Use case |
|---|---|---|
| Single shared bank | default |
All agents share one knowledge pool |
| Per-agent isolation | agent-name |
Each agent has private, non-overlapping memory |
| Per-project isolation | project-slug |
Multiple agents work on the same project corpus |
| Per-team + per-agent | team/agent |
Hierarchical separation (if Hindsight supports nested IDs) |
# Continuous scan at the configured interval
python3 main.py
# Override the interval from the command line
python3 main.py --interval 5
# Single pass then exit (good for cron)
python3 main.py --once
# Add folders without editing config.yaml
python3 main.py --folder /docs/a --folder /docs/b
# Point at a cloud instance with an API key
python3 main.py --server-url https://api.hindsight.example.com --api-key sk-...
# Full help
python3 main.py --help| Flag | Default | Description |
|---|---|---|
--config PATH |
config.yaml |
Path to config file |
--server-url URL |
from config | Hindsight server URL |
--api-key KEY |
from config | Bearer token for cloud auth |
--bank-id ID |
from config | Memory bank to write to |
--source TEXT |
from config | Context / source label |
--session TEXT |
from config | Session tag for metadata |
--folder PATH |
from config | Add a watched folder (repeatable) |
--interval MINUTES |
from config | Scan interval override |
--once |
off | Single pass then exit |
--db PATH |
ingestion_manifest.db |
SQLite manifest path |
- Scan — walks each configured folder recursively, computes SHA-256 per file, compares against the SQLite manifest. Only new or changed files proceed.
- Ingest — sends each file to Hindsight. Primary path is a native file upload (
POST /v1/default/banks/{bank_id}/files). On failure, format-specific extractors pull text locally and submit via the memories API. - Record — on success, the file path + hash + doc ID are written to the manifest. Re-running the same file is a no-op until its content changes.
- Schedule — APScheduler repeats the scan every
scan_interval_minutes. Run with--onceto use your own cron instead.
pip install -r requirements-dev.txt
# Unit tests (no network required)
pytest tests/ -v
# With coverage
pytest tests/ --cov=hindsight_ingest --cov-report=term-missing
# Integration tests (requires running Hindsight at localhost:8888)
HINDSIGHT_INTEGRATION=1 pytest tests/integration/ -v# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements-dev.txt
pytest tests/ -v