Skip to content

cocoindex-io/realtime-codebase-indexing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Real-time codebase indexing with CocoIndex and Tree-sitter — language-aware chunking, embeddings, semantic search, and a live vector index in plain async Python

Build Real-Time Codebase Indexing

A live, syntax-aware vector index over your repo — in ~100 lines of plain async Python.
Point it at a codebase, search it in natural language, and it re-embeds only what changes as you edit.

Star us ❤️ → Star CocoIndex on GitHub  ·  CocoIndex documentation  ·  Join the CocoIndex Discord

GitHub PyPI version Discord License

📕 Documentation  ·  📘 Step-by-step tutorial  ·  🎬 Watch on YouTube


Build a codebase index that's always up to date. CocoIndex has built-in, native Tree-sitter chunking, so it splits along real code structure — functions, classes, blocks — embeds each chunk, and keeps the index fresh with incremental processing: a one-line edit re-embeds one chunk, not the repo. You declare the transformation in native Python and your own types — target_state = transformation(source_state) — and a Rust engine does the incremental processing, change tracking, and managed targets underneath.

Use cases

Use cases for an always-fresh codebase index: coding agents, code-review agents, semantic code search, and MCP for editors

A wide range of applications can be built on an effective codebase index that is always up to date:

  • Semantic code context for AI coding agents like Claude, Codex, and Gemini CLI.
  • MCP for code editors such as Cursor, Windsurf, and VSCode.
  • Context-aware code search — semantic code search, natural-language code retrieval.
  • Code-review agents — automated analysis, quality checks, pull-request summarization.
  • Refactoring & migration — large-scale, automated code changes.
  • SRE workflows — index infra-as-code, deploy scripts, and configs for rapid root-cause analysis and change-impact assessment.
  • Living design docs — generate documentation from code and keep it current.

Why CocoIndex for codebase indexing

Why CocoIndex: syntax-aware Tree-sitter chunking, incremental updates, live mode, plain Python, and a consistent index and query path

  • Syntax-aware chunking, built in. Tree-sitter splits along real code structure, so retrieval returns whole units, not fragments cut mid-statement. Every major language; unknown types fall back to plain text.
  • Incremental by default. @coco.fn(memo=True) skips unchanged files and reuses embeddings for unchanged chunks; the target upserts only the rows that moved and deletes orphans. Edit one function → one chunk is re-embedded.
  • Live updates. live=True + cocoindex update -L keeps watching the filesystem and applies changes with low latency — always-fresh context for an agent.
  • Plain Python, your stack. Swap the embedding model (12k+ on Hugging Face), the chunking, or the vector store. No DSL.
  • Consistent index & query. The same embedder is shared by the indexing and query paths, so what you index is what you search.

How it works

CocoIndex code-embedding flow: localfs.walk_dir source → per-file processing component (detect language, Tree-sitter RecursiveSplitter, coco.map → embed each chunk → declare CodeEmbedding rows) → Postgres pgvector target with a cosine vector index

Walk a repo → detect language → split along the syntax tree with Tree-sitter → embed each chunk → upsert into Postgres (pgvector). With live=True, the source keeps watching and the index stays fresh as you code. The whole indexing path is the snippet below — read it top-to-bottom in main.py:

@coco.fn(memo=True)
async def process_file(file: FileLike, table: postgres.TableTarget[CodeEmbedding]) -> None:
    text = await file.read_text()
    language = detect_code_language(filename=str(file.file_path.path.name))
    chunks = _splitter.split(text, chunk_size=1000, min_chunk_size=300,
                             chunk_overlap=300, language=language)   # Tree-sitter, syntax-aware
    id_gen = IdGenerator()
    await coco.map(process_chunk, chunks, file.file_path.path, id_gen, table)

@coco.fn
async def process_chunk(chunk, filename, id_gen, table) -> None:
    embedding = await coco.use_context(EMBEDDER).embed(chunk.text)
    table.declare_row(row=CodeEmbedding(
        id=await id_gen.next_id(chunk.text), filename=str(filename), code=chunk.text,
        embedding=embedding, start_line=chunk.start.line, end_line=chunk.end.line,
    ))

@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    table = await postgres.mount_table_target(PG_DB, table_name=TABLE_NAME, ...)
    table.declare_vector_index(column="embedding")
    files = localfs.walk_dir(sourcedir, recursive=True,
                             path_matcher=PatternFilePathMatcher(included_patterns=["**/*.py", ...]),
                             live=True)
    await coco.mount_each(process_file, files.items(), table)

📘 Full tutorial →
Step-by-step walkthrough of the data model, the lifespan, chunking, embedding, the App, and exactly what happens on each kind of change.

Run it

1. Postgres + pgvector. Install Postgres with the pgvector extension if you don't have one, then point the example at it:

export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"

CocoIndex keeps its own incremental-processing state in a local file (COCOINDEX_DB, default ./cocoindex.db; see .env). By default the flow indexes this repository — set COCOINDEX_SOURCE_PATH to index any other codebase.

2. Install dependencies:

pip install -e .

3. Build / update the index (writes rows into Postgres) — pick one:

cocoindex update main       # catch-up: scan, sync changes, exit
cocoindex update -L main    # live: catch up, then keep watching for edits

4. Query it — semantic search from the terminal:

python main.py "your query"   # one-shot
python main.py                # interactive loop

Semantic search results in the terminal: similarity score, filename, matched line range, and the code snippet

Each result carries start_line/end_line, so hits point straight at the lines that matched. Query uses pgvector's <=> cosine distance, turned into a similarity score, and reuses the same embedder as the indexing path.

Incremental & real-time updates

A file edited and re-chunked: unchanged chunks are reused with no re-embedding, a removed chunk's row is deleted, and a new chunk is embedded and inserted

Edit a file and re-run (or leave cocoindex update -L running): unchanged chunks are reused with no re-embedding, a removed chunk's row is deleted, and a new chunk is embedded and inserted — only the delta moves. That's what keeps the index cheap to maintain and always fresh for an agent.

Want it production-ready, not DIY?

CocoIndex Code is this exact pipeline — AST-aware chunking, incremental re-index, local embeddings — hardened and packaged as a CLI and an MCP server you can plug straight into a coding or code-review agent.

CocoIndex Code — semantic code search for coding agents, as a CLI and MCP server

npx skills add cocoindex-io/cocoindex-code      # Claude Code skill, then /ccc
claude mcp add cocoindex-code -- ccc mcp        # MCP: Codex, OpenCode, Cursor, any client
ccc index && ccc search "where we embed chunks" # CLI

If this made your agents smarter, give CocoIndex a star ⭐ — it helps a lot.
Documentation · Walkthrough · Discord · See all examples →

About

build codebase index with tree-sitter. works with large codebases, and can be updated in near real-time with incremental processing - only reprocess what's changed.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages