Architecture Overview

This document explains the architecture of opencode-codebase-index, including data flow, component interactions, and key design decisions.

High-Level Architecture
Data Flow
- Indexing Flow
- Search Flow
Component Details
Design Decisions
Performance Characteristics
Security Considerations
Extending the Architecture

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              OpenCode Agent                                 │
│                                                                             │
│  Tools: codebase_search, codebase_peek, find_similar, call_graph,           │
│         index_codebase, index_status, index_health_check, index_metrics,    │
│         index_logs, add_knowledge_base, list_knowledge_bases,               │
│         remove_knowledge_base                                               │
│  Commands: /search, /find, /call-graph, /index, /status                     │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           TypeScript Layer                                  │
│                                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │   Indexer    │  │  Embeddings  │  │   Watcher    │  │     Git      │     │
│  │              │  │   Provider   │  │              │  │   Detector   │     │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘     │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         Rust Native Module (NAPI)                           │
│                                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │  Tree-sitter │  │   usearch    │  │    SQLite    │  │     BM25     │     │
│  │   Parser     │  │   Vectors    │  │   Database   │  │ Inverted Idx │     │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘     │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              Storage Layer                                  │
│                                                                             │
│  .opencode/index/                                                           │
│  ├── codebase.db           # SQLite: embeddings, chunks, branch catalog     │
│  ├── vectors.usearch       # Vector index (uSearch)                         │
│  ├── inverted-index.json   # BM25 keyword index                             │
│  └── file-hashes.json      # File change detection                          │
└─────────────────────────────────────────────────────────────────────────────┘

Data Flow

Indexing Flow

Source Files → Parse → Chunk → Embed → Store

1. COLLECT: File discovery (respects .gitignore)
   └─ src/utils/files.ts: collectFiles()

2. DELTA: Check what's changed
   └─ Compare file hashes (xxhash) against stored hashes
   └─ Only process new/modified files

3. PARSE: Tree-sitter language-aware parsing
   └─ native/src/parser.rs: parse_file()
   └─ Extracts: functions, classes, methods, interfaces
   └─ Includes: JSDoc/docstrings with their code

4. CHUNK: Split large blocks with overlap
   └─ native/src/chunker.rs: semantic chunking
   └─ Preserves code structure boundaries
   └─ Adds overlap for context continuity

5. EMBED: Convert to vectors via AI provider
   └─ src/embeddings/provider.ts
   └─ Deduped by content hash (same code = same embedding)

6. STORE: Persist to disk
   └─ SQLite: embeddings (by hash), chunks, branch catalog
   └─ usearch: vector index for similarity search
   └─ BM25: inverted index for keyword search

Search Flow

Query → Embed → Search → Rank → Return

1. EMBED QUERY
   └─ Same embedding model as indexing
   └─ Single API call (~800ms latency)

2. PARALLEL SEARCH
   ├─ SEMANTIC: usearch cosine similarity
   │  └─ Returns top-K similar vectors
   └─ KEYWORD: BM25 inverted index
      └─ Returns top-K keyword matches

3. HYBRID FUSION
   └─ Combines semantic + keyword candidates
   └─ Fusion controlled by fusionStrategy (rrf default, weighted fallback)
   └─ Deterministic rerank applies to top-N candidates

4. BRANCH FILTER
   └─ Only returns chunks existing on current branch
   └─ Prevents stale results from other branches

5. RETURN RESULTS
   └─ File path, line numbers, code snippet
   └─ Sorted by combined score

Component Details

Indexer (`src/indexer/index.ts`)

The central orchestrator. Responsibilities:

Manages full and incremental indexing
Coordinates parsing → embedding → storage
Handles rate limiting and retries
Tracks per-file hashes for delta detection

Key methods:

Method	Purpose
`index()`	Main entry: orchestrates full indexing flow
`searchSemantic()`	Pure vector similarity search
`searchHybrid()`	Combines semantic + BM25
`cleanup()`	Garbage collection for orphaned data

Embedding Provider (`src/embeddings/`)

Abstracts different AI embedding APIs:

Provider	Implementation	Rate Limit Strategy
GitHub Copilot	OAuth + internal API	1 concurrent, 4s delay
OpenAI	Official API	3 concurrent, 500ms delay
Google	Gemini API	5 concurrent, 200ms delay
Ollama	Local REST	5 concurrent, no delay

Detection order: GitHub Copilot → OpenAI → Google → Ollama

Native Module (`native/src/`)

Rust components exposed via NAPI:

Component	Crate	Purpose
Parser	tree-sitter-*	Language-aware code parsing
VectorStore	usearch	HNSW vector similarity search
Database	rusqlite	Persistent storage with batch ops
InvertedIndex	Custom	BM25 keyword search
Hasher	xxhash-rust	Fast content hashing

Watcher (`src/watcher/index.ts`)

File system observer using chokidar:

Watches for file changes → triggers incremental index
Watches .git/HEAD → detects branch switches
Debounces rapid changes (500ms window)
Merges additionalInclude patterns with include patterns for proper file filtering

Design Decisions

Why Hybrid TypeScript + Rust?

Layer	Language	Rationale
Plugin interface	TypeScript	Native OpenCode integration, config parsing
Core logic	TypeScript	Orchestration, API calls, easier iteration
Hot paths	Rust	Performance: parsing, vectors, DB operations

The 80/20 rule: TypeScript for flexibility, Rust for speed-critical operations.

Why usearch for Vectors?

Alternatives considered:

FAISS: Heavier, complex build, overkill for our scale
hnswlib: Good, but usearch is faster and has F16 support
In-memory arrays: Too slow for 10k+ vectors

usearch advantages:

F16 quantization → 50% memory savings
Fast HNSW algorithm
Simple C++ core, easy Rust bindings
Persistent on-disk index

Why SQLite for Storage?

Alternatives considered:

JSON files: No transactions, slow for large data
LevelDB/RocksDB: Overkill, complex keys
PostgreSQL: External dependency, overkill

SQLite advantages:

Single-file database
ACID transactions for batch inserts
Fast lookups by content hash
Built-in query capabilities
Widely supported in Rust

Why BM25 Hybrid Search?

Pure semantic search has weaknesses:

Misses exact identifier matches
Can't find "the function named exactly X"
Embedding models have knowledge cutoffs

BM25 hybrid provides:

Exact keyword matching for precision
Fallback when semantic misses
Better results for technical queries
Configurable weighting (hybridWeight)

Why Optimized Tool Return Formats?

Problem: Redundant prompt phrases in tool responses increase token usage and may cause LLMs to exit reasoning prematurely.

Solution:

Remove summary phrases: e.g., "Found X results", "Index status:", "Health check complete:"
Return raw data: Direct result lists without introductory text
Maintain clarity: Keep essential context for unambiguous results

Benefits:

Reduced token consumption for LLM tool calls
Faster LLM processing (less text to parse)
Better integration with LLM reasoning loops
Maintained functionality with cleaner output

Why Branch-Aware Indexing?

Problem: Switching branches changes code but embeddings are expensive.

Solution:

Store embeddings by content hash (not by file)
- Same code = same embedding, regardless of branch
- Deduplicated storage
Branch catalog tracks membership
- Lightweight: just chunk IDs per branch
- Instant branch switch (no re-embedding)
Filter search by current branch
- Query only returns relevant results
- No stale results from other branches

Why Content-Based Deduplication?

Instead of storing embeddings per-file, we hash the content:

hash(code) → embedding_id
Same utility function across files? One embedding.
Copy-paste code? Already embedded.

Benefits:

Reduces token costs (don't re-embed duplicates)
Smaller index size
Faster incremental indexing

Performance Characteristics

Indexing Performance

Phase	Time Complexity	Actual Performance
File collection	O(n files)	~10ms for 1000 files
Parsing	O(n files × file size)	~7ms for 100 files
Embedding	O(n chunks) × API latency	Bottleneck (rate limited)
Storage	O(n chunks)	~4ms for 1000 chunks (batch)

Search Performance

Phase	Time Complexity	Actual Performance
Query embedding	O(1) API call	~800-1000ms
Vector search	O(log n) HNSW	~1ms for 10k vectors
BM25 search	O(n tokens)	~5ms for 50k tokens
Result fusion	O(k results)	<1ms

Total search latency: ~800-1000ms (dominated by embedding API call)

Memory Usage

Component	Memory Profile
Vector index	~3KB per chunk (F16 quantization)
SQLite	~1KB per chunk metadata
BM25 index	~2KB per unique token

For a typical 500-file codebase (~5000 chunks): ~30MB total

Tool Call Performance

Tool return formats are optimized to reduce token usage:

Tool	Before Optimization	After Optimization	Token Savings
`codebase_search`	"Found X results for 'query': ..."	Raw result list	~15-20 tokens
`codebase_peek`	"Found X locations for 'query': ..."	Raw result list	~15-20 tokens
`find_similar`	"Found X similar code blocks: ..."	Raw result list	~15-20 tokens
`call_graph`	"X calls Y function(s): ..."	Raw result list	~10-15 tokens
`index_status`	"Index status: ..."	Raw data	~5-10 tokens
`formatHealthCheck`	"Health check complete: ..."	Raw data	~5-10 tokens

Impact: Reduces LLM context size, improves reasoning loop efficiency, and lowers API costs.

Security Considerations

What Gets Sent to Cloud

Data	Destination	Purpose
Code chunks	Embedding provider	Vector generation
Search queries	Embedding provider	Query embedding

The vector index itself stays local. Only code/queries go to the embedding API.

Privacy Options

For maximum privacy, use Ollama:

{ "embeddingProvider": "ollama" }

All processing happens locally. Nothing leaves your machine.

Credential Handling

GitHub Copilot: Uses OpenCode's OAuth token
OpenAI/Google: Reads from environment variables
Ollama: Local REST, no credentials needed

No credentials are stored by the plugin.

Extending the Architecture

Adding a New Language

Add tree-sitter grammar to native/Cargo.toml
Update native/src/types.rs: Language enum
Update native/src/parser.rs:
- ts_language() match arm
- is_comment_node() patterns
- is_semantic_node() patterns
- Note: Recursion depth is limited to 1024 levels to prevent stack overflow
Add tests in native/src/parser.rs

Adding a New Embedding Provider

Add detection in src/embeddings/detector.ts
Implement embed function in src/embeddings/provider.ts
Add rate limit config in src/indexer/index.ts

Adding a New Storage Backend

Implement storage interface (see native/src/db.rs)
Expose via NAPI in native/src/lib.rs
Update src/native/index.ts wrapper
Update src/indexer/index.ts to use new backend

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Overview

Table of Contents

High-Level Architecture

Data Flow

Indexing Flow

Search Flow

Component Details

Indexer (`src/indexer/index.ts`)

Embedding Provider (`src/embeddings/`)

Native Module (`native/src/`)

Watcher (`src/watcher/index.ts`)

Design Decisions

Why Hybrid TypeScript + Rust?

Why usearch for Vectors?

Why SQLite for Storage?

Why BM25 Hybrid Search?

Why Optimized Tool Return Formats?

Why Branch-Aware Indexing?

Why Content-Based Deduplication?

Performance Characteristics

Indexing Performance

Search Performance

Memory Usage

Tool Call Performance

Security Considerations

What Gets Sent to Cloud

Privacy Options

Credential Handling

Extending the Architecture

Adding a New Language

Adding a New Embedding Provider

Adding a New Storage Backend

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Architecture Overview

Table of Contents

High-Level Architecture

Data Flow

Indexing Flow

Search Flow

Component Details

Indexer (src/indexer/index.ts)

Embedding Provider (src/embeddings/)

Native Module (native/src/)

Watcher (src/watcher/index.ts)

Design Decisions

Why Hybrid TypeScript + Rust?

Why usearch for Vectors?

Why SQLite for Storage?

Why BM25 Hybrid Search?

Why Optimized Tool Return Formats?

Why Branch-Aware Indexing?

Why Content-Based Deduplication?

Performance Characteristics

Indexing Performance

Search Performance

Memory Usage

Tool Call Performance

Security Considerations

What Gets Sent to Cloud

Privacy Options

Credential Handling

Extending the Architecture

Adding a New Language

Adding a New Embedding Provider

Adding a New Storage Backend

Indexer (`src/indexer/index.ts`)

Embedding Provider (`src/embeddings/`)

Native Module (`native/src/`)

Watcher (`src/watcher/index.ts`)