Build Real-Time Codebase Indexing

A live, syntax-aware vector index over your repo — in ~100 lines of plain async Python.
Point it at a codebase, search it in natural language, and it re-embeds only what changes as you edit.

Star us ❤️ → · ·

📕 Documentation · 📘 Step-by-step tutorial · 🎬 Watch on YouTube

Build a codebase index that's always up to date. CocoIndex has built-in, native Tree-sitter chunking, so it splits along real code structure — functions, classes, blocks — embeds each chunk, and keeps the index fresh with incremental processing: a one-line edit re-embeds one chunk, not the repo. You declare the transformation in native Python and your own types — target_state = transformation(source_state) — and a Rust engine does the incremental processing, change tracking, and managed targets underneath.

Use cases

A wide range of applications can be built on an effective codebase index that is always up to date:

Semantic code context for AI coding agents like Claude, Codex, and Gemini CLI.
MCP for code editors such as Cursor, Windsurf, and VSCode.
Context-aware code search — semantic code search, natural-language code retrieval.
Code-review agents — automated analysis, quality checks, pull-request summarization.
Refactoring & migration — large-scale, automated code changes.
SRE workflows — index infra-as-code, deploy scripts, and configs for rapid root-cause analysis and change-impact assessment.
Living design docs — generate documentation from code and keep it current.

Why CocoIndex for codebase indexing

Syntax-aware chunking, built in. Tree-sitter splits along real code structure, so retrieval returns whole units, not fragments cut mid-statement. Every major language; unknown types fall back to plain text.
Incremental by default. @coco.fn(memo=True) skips unchanged files and reuses embeddings for unchanged chunks; the target upserts only the rows that moved and deletes orphans. Edit one function → one chunk is re-embedded.
Live updates. live=True + cocoindex update -L keeps watching the filesystem and applies changes with low latency — always-fresh context for an agent.
Plain Python, your stack. Swap the embedding model (12k+ on Hugging Face), the chunking, or the vector store. No DSL.
Consistent index & query. The same embedder is shared by the indexing and query paths, so what you index is what you search.

How it works

Walk a repo → detect language → split along the syntax tree with Tree-sitter → embed each chunk → upsert into Postgres (pgvector). With live=True, the source keeps watching and the index stays fresh as you code. The whole indexing path is the snippet below — read it top-to-bottom in main.py:

@coco.fn(memo=True)
async def process_file(file: FileLike, table: postgres.TableTarget[CodeEmbedding]) -> None:
    text = await file.read_text()
    language = detect_code_language(filename=str(file.file_path.path.name))
    chunks = _splitter.split(text, chunk_size=1000, min_chunk_size=300,
                             chunk_overlap=300, language=language)   # Tree-sitter, syntax-aware
    id_gen = IdGenerator()
    await coco.map(process_chunk, chunks, file.file_path.path, id_gen, table)

@coco.fn
async def process_chunk(chunk, filename, id_gen, table) -> None:
    embedding = await coco.use_context(EMBEDDER).embed(chunk.text)
    table.declare_row(row=CodeEmbedding(
        id=await id_gen.next_id(chunk.text), filename=str(filename), code=chunk.text,
        embedding=embedding, start_line=chunk.start.line, end_line=chunk.end.line,
    ))

@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    table = await postgres.mount_table_target(PG_DB, table_name=TABLE_NAME, ...)
    table.declare_vector_index(column="embedding")
    files = localfs.walk_dir(sourcedir, recursive=True,
                             path_matcher=PatternFilePathMatcher(included_patterns=["**/*.py", ...]),
                             live=True)
    await coco.mount_each(process_file, files.items(), table)

📘 Full tutorial →
Step-by-step walkthrough of the data model, the lifespan, chunking, embedding, the App, and exactly what happens on each kind of change.

Run it

1. Postgres + pgvector. Install Postgres with the pgvector extension if you don't have one, then point the example at it:

export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"

CocoIndex keeps its own incremental-processing state in a local file (COCOINDEX_DB, default ./cocoindex.db; see .env). By default the flow indexes this repository — set COCOINDEX_SOURCE_PATH to index any other codebase.

2. Install dependencies:

pip install -e .

3. Build / update the index (writes rows into Postgres) — pick one:

cocoindex update main       # catch-up: scan, sync changes, exit
cocoindex update -L main    # live: catch up, then keep watching for edits

4. Query it — semantic search from the terminal:

python main.py "your query"   # one-shot
python main.py                # interactive loop

Each result carries start_line/end_line, so hits point straight at the lines that matched. Query uses pgvector's <=> cosine distance, turned into a similarity score, and reuses the same embedder as the indexing path.

Incremental & real-time updates

Edit a file and re-run (or leave cocoindex update -L running): unchanged chunks are reused with no re-embedding, a removed chunk's row is deleted, and a new chunk is embedded and inserted — only the delta moves. That's what keeps the index cheap to maintain and always fresh for an agent.

Want it production-ready, not DIY?

CocoIndex Code is this exact pipeline — AST-aware chunking, incremental re-index, local embeddings — hardened and packaged as a CLI and an MCP server you can plug straight into a coding or code-review agent.

npx skills add cocoindex-io/cocoindex-code      # Claude Code skill, then /ccc
claude mcp add cocoindex-code -- ccc mcp        # MCP: Codex, OpenCode, Cursor, any client
ccc index && ccc search "where we embed chunks" # CLI

If this made your agents smarter, give CocoIndex a star ⭐ — it helps a lot.
Documentation · Walkthrough · Discord · See all examples →

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Build Real-Time Codebase Indexing

Use cases

Why CocoIndex for codebase indexing

How it works

Run it

Incremental & real-time updates

Want it production-ready, not DIY?

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Build Real-Time Codebase Indexing

Use cases

Why CocoIndex for codebase indexing

How it works

Run it

Incremental & real-time updates

Want it production-ready, not DIY?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages