This document explains the architecture of opencode-codebase-index, including data flow, component interactions, and key design decisions.
- High-Level Architecture
- Data Flow
- Component Details
- Design Decisions
- Performance Characteristics
- Security Considerations
- Extending the Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ OpenCode Agent │
│ │
│ Tools: codebase_search, codebase_peek, find_similar, call_graph, │
│ index_codebase, index_status, index_health_check, index_metrics, │
│ index_logs, add_knowledge_base, list_knowledge_bases, │
│ remove_knowledge_base │
│ Commands: /search, /find, /call-graph, /index, /status │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ TypeScript Layer │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Indexer │ │ Embeddings │ │ Watcher │ │ Git │ │
│ │ │ │ Provider │ │ │ │ Detector │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Rust Native Module (NAPI) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Tree-sitter │ │ usearch │ │ SQLite │ │ BM25 │ │
│ │ Parser │ │ Vectors │ │ Database │ │ Inverted Idx │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ │
│ .opencode/index/ │
│ ├── codebase.db # SQLite: embeddings, chunks, branch catalog │
│ ├── vectors.usearch # Vector index (uSearch) │
│ ├── inverted-index.json # BM25 keyword index │
│ └── file-hashes.json # File change detection │
└─────────────────────────────────────────────────────────────────────────────┘
Source Files → Parse → Chunk → Embed → Store
1. COLLECT: File discovery (respects .gitignore)
└─ src/utils/files.ts: collectFiles()
2. DELTA: Check what's changed
└─ Compare file hashes (xxhash) against stored hashes
└─ Only process new/modified files
3. PARSE: Tree-sitter language-aware parsing
└─ native/src/parser.rs: parse_file()
└─ Extracts: functions, classes, methods, interfaces
└─ Includes: JSDoc/docstrings with their code
4. CHUNK: Split large blocks with overlap
└─ native/src/chunker.rs: semantic chunking
└─ Preserves code structure boundaries
└─ Adds overlap for context continuity
5. EMBED: Convert to vectors via AI provider
└─ src/embeddings/provider.ts
└─ Deduped by content hash (same code = same embedding)
6. STORE: Persist to disk
└─ SQLite: embeddings (by hash), chunks, branch catalog
└─ usearch: vector index for similarity search
└─ BM25: inverted index for keyword search
Query → Embed → Search → Rank → Return
1. EMBED QUERY
└─ Same embedding model as indexing
└─ Single API call (~800ms latency)
2. PARALLEL SEARCH
├─ SEMANTIC: usearch cosine similarity
│ └─ Returns top-K similar vectors
└─ KEYWORD: BM25 inverted index
└─ Returns top-K keyword matches
3. HYBRID FUSION
└─ Combines semantic + keyword candidates
└─ Fusion controlled by fusionStrategy (rrf default, weighted fallback)
└─ Deterministic rerank applies to top-N candidates
4. BRANCH FILTER
└─ Only returns chunks existing on current branch
└─ Prevents stale results from other branches
5. RETURN RESULTS
└─ File path, line numbers, code snippet
└─ Sorted by combined score
The central orchestrator. Responsibilities:
- Manages full and incremental indexing
- Coordinates parsing → embedding → storage
- Handles rate limiting and retries
- Tracks per-file hashes for delta detection
Key methods:
| Method | Purpose |
|---|---|
index() |
Main entry: orchestrates full indexing flow |
searchSemantic() |
Pure vector similarity search |
searchHybrid() |
Combines semantic + BM25 |
cleanup() |
Garbage collection for orphaned data |
Abstracts different AI embedding APIs:
| Provider | Implementation | Rate Limit Strategy |
|---|---|---|
| GitHub Copilot | OAuth + internal API | 1 concurrent, 4s delay |
| OpenAI | Official API | 3 concurrent, 500ms delay |
| Gemini API | 5 concurrent, 200ms delay | |
| Ollama | Local REST | 5 concurrent, no delay |
Detection order: GitHub Copilot → OpenAI → Google → Ollama
Rust components exposed via NAPI:
| Component | Crate | Purpose |
|---|---|---|
| Parser | tree-sitter-* | Language-aware code parsing |
| VectorStore | usearch | HNSW vector similarity search |
| Database | rusqlite | Persistent storage with batch ops |
| InvertedIndex | Custom | BM25 keyword search |
| Hasher | xxhash-rust | Fast content hashing |
File system observer using chokidar:
- Watches for file changes → triggers incremental index
- Watches
.git/HEAD→ detects branch switches - Debounces rapid changes (500ms window)
- Merges
additionalIncludepatterns withincludepatterns for proper file filtering
| Layer | Language | Rationale |
|---|---|---|
| Plugin interface | TypeScript | Native OpenCode integration, config parsing |
| Core logic | TypeScript | Orchestration, API calls, easier iteration |
| Hot paths | Rust | Performance: parsing, vectors, DB operations |
The 80/20 rule: TypeScript for flexibility, Rust for speed-critical operations.
Alternatives considered:
- FAISS: Heavier, complex build, overkill for our scale
- hnswlib: Good, but usearch is faster and has F16 support
- In-memory arrays: Too slow for 10k+ vectors
usearch advantages:
- F16 quantization → 50% memory savings
- Fast HNSW algorithm
- Simple C++ core, easy Rust bindings
- Persistent on-disk index
Alternatives considered:
- JSON files: No transactions, slow for large data
- LevelDB/RocksDB: Overkill, complex keys
- PostgreSQL: External dependency, overkill
SQLite advantages:
- Single-file database
- ACID transactions for batch inserts
- Fast lookups by content hash
- Built-in query capabilities
- Widely supported in Rust
Pure semantic search has weaknesses:
- Misses exact identifier matches
- Can't find "the function named exactly X"
- Embedding models have knowledge cutoffs
BM25 hybrid provides:
- Exact keyword matching for precision
- Fallback when semantic misses
- Better results for technical queries
- Configurable weighting (hybridWeight)
Problem: Redundant prompt phrases in tool responses increase token usage and may cause LLMs to exit reasoning prematurely.
Solution:
- Remove summary phrases: e.g., "Found X results", "Index status:", "Health check complete:"
- Return raw data: Direct result lists without introductory text
- Maintain clarity: Keep essential context for unambiguous results
Benefits:
- Reduced token consumption for LLM tool calls
- Faster LLM processing (less text to parse)
- Better integration with LLM reasoning loops
- Maintained functionality with cleaner output
Problem: Switching branches changes code but embeddings are expensive.
Solution:
-
Store embeddings by content hash (not by file)
- Same code = same embedding, regardless of branch
- Deduplicated storage
-
Branch catalog tracks membership
- Lightweight: just chunk IDs per branch
- Instant branch switch (no re-embedding)
-
Filter search by current branch
- Query only returns relevant results
- No stale results from other branches
Instead of storing embeddings per-file, we hash the content:
hash(code) → embedding_id- Same utility function across files? One embedding.
- Copy-paste code? Already embedded.
Benefits:
- Reduces token costs (don't re-embed duplicates)
- Smaller index size
- Faster incremental indexing
| Phase | Time Complexity | Actual Performance |
|---|---|---|
| File collection | O(n files) | ~10ms for 1000 files |
| Parsing | O(n files × file size) | ~7ms for 100 files |
| Embedding | O(n chunks) × API latency | Bottleneck (rate limited) |
| Storage | O(n chunks) | ~4ms for 1000 chunks (batch) |
| Phase | Time Complexity | Actual Performance |
|---|---|---|
| Query embedding | O(1) API call | ~800-1000ms |
| Vector search | O(log n) HNSW | ~1ms for 10k vectors |
| BM25 search | O(n tokens) | ~5ms for 50k tokens |
| Result fusion | O(k results) | <1ms |
Total search latency: ~800-1000ms (dominated by embedding API call)
| Component | Memory Profile |
|---|---|
| Vector index | ~3KB per chunk (F16 quantization) |
| SQLite | ~1KB per chunk metadata |
| BM25 index | ~2KB per unique token |
For a typical 500-file codebase (~5000 chunks): ~30MB total
Tool return formats are optimized to reduce token usage:
| Tool | Before Optimization | After Optimization | Token Savings |
|---|---|---|---|
codebase_search |
"Found X results for 'query': ..." | Raw result list | ~15-20 tokens |
codebase_peek |
"Found X locations for 'query': ..." | Raw result list | ~15-20 tokens |
find_similar |
"Found X similar code blocks: ..." | Raw result list | ~15-20 tokens |
call_graph |
"X calls Y function(s): ..." | Raw result list | ~10-15 tokens |
index_status |
"Index status: ..." | Raw data | ~5-10 tokens |
formatHealthCheck |
"Health check complete: ..." | Raw data | ~5-10 tokens |
Impact: Reduces LLM context size, improves reasoning loop efficiency, and lowers API costs.
| Data | Destination | Purpose |
|---|---|---|
| Code chunks | Embedding provider | Vector generation |
| Search queries | Embedding provider | Query embedding |
The vector index itself stays local. Only code/queries go to the embedding API.
For maximum privacy, use Ollama:
{ "embeddingProvider": "ollama" }All processing happens locally. Nothing leaves your machine.
- GitHub Copilot: Uses OpenCode's OAuth token
- OpenAI/Google: Reads from environment variables
- Ollama: Local REST, no credentials needed
No credentials are stored by the plugin.
- Add tree-sitter grammar to
native/Cargo.toml - Update
native/src/types.rs:Languageenum - Update
native/src/parser.rs:ts_language()match armis_comment_node()patternsis_semantic_node()patterns- Note: Recursion depth is limited to 1024 levels to prevent stack overflow
- Add tests in
native/src/parser.rs
- Add detection in
src/embeddings/detector.ts - Implement embed function in
src/embeddings/provider.ts - Add rate limit config in
src/indexer/index.ts
- Implement storage interface (see
native/src/db.rs) - Expose via NAPI in
native/src/lib.rs - Update
src/native/index.tswrapper - Update
src/indexer/index.tsto use new backend