|
| 1 | +# Architecture Overview |
| 2 | + |
| 3 | +This document explains the architecture of opencode-codebase-index, including data flow, component interactions, and key design decisions. |
| 4 | + |
| 5 | +## High-Level Architecture |
| 6 | + |
| 7 | +``` |
| 8 | +┌─────────────────────────────────────────────────────────────────────────────┐ |
| 9 | +│ OpenCode Agent │ |
| 10 | +│ │ |
| 11 | +│ Tools: codebase_search, index_codebase, index_status, index_health_check │ |
| 12 | +│ Commands: /search, /find, /index, /status │ |
| 13 | +└─────────────────────────────────────────────────────────────────────────────┘ |
| 14 | + │ |
| 15 | + ▼ |
| 16 | +┌─────────────────────────────────────────────────────────────────────────────┐ |
| 17 | +│ TypeScript Layer │ |
| 18 | +│ │ |
| 19 | +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ |
| 20 | +│ │ Indexer │ │ Embeddings │ │ Watcher │ │ Git │ │ |
| 21 | +│ │ │ │ Provider │ │ │ │ Detector │ │ |
| 22 | +│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │ |
| 23 | +└─────────────────────────────────────────────────────────────────────────────┘ |
| 24 | + │ |
| 25 | + ▼ |
| 26 | +┌─────────────────────────────────────────────────────────────────────────────┐ |
| 27 | +│ Rust Native Module (NAPI) │ |
| 28 | +│ │ |
| 29 | +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ |
| 30 | +│ │ Tree-sitter │ │ usearch │ │ SQLite │ │ BM25 │ │ |
| 31 | +│ │ Parser │ │ Vectors │ │ Database │ │ Inverted Idx │ │ |
| 32 | +│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │ |
| 33 | +└─────────────────────────────────────────────────────────────────────────────┘ |
| 34 | + │ |
| 35 | + ▼ |
| 36 | +┌─────────────────────────────────────────────────────────────────────────────┐ |
| 37 | +│ Storage Layer │ |
| 38 | +│ │ |
| 39 | +│ .opencode/index/ │ |
| 40 | +│ ├── codebase.db # SQLite: embeddings, chunks, branch catalog │ |
| 41 | +│ ├── vectors.usearch # Vector index (uSearch) │ |
| 42 | +│ ├── inverted-index.json # BM25 keyword index │ |
| 43 | +│ └── file-hashes.json # File change detection │ |
| 44 | +└─────────────────────────────────────────────────────────────────────────────┘ |
| 45 | +``` |
| 46 | + |
| 47 | +## Data Flow |
| 48 | + |
| 49 | +### Indexing Flow |
| 50 | + |
| 51 | +``` |
| 52 | +Source Files → Parse → Chunk → Embed → Store |
| 53 | +
|
| 54 | +1. COLLECT: File discovery (respects .gitignore) |
| 55 | + └─ src/utils/files.ts: collectFiles() |
| 56 | +
|
| 57 | +2. DELTA: Check what's changed |
| 58 | + └─ Compare file hashes (xxhash) against stored hashes |
| 59 | + └─ Only process new/modified files |
| 60 | +
|
| 61 | +3. PARSE: Tree-sitter language-aware parsing |
| 62 | + └─ native/src/parser.rs: parse_file() |
| 63 | + └─ Extracts: functions, classes, methods, interfaces |
| 64 | + └─ Includes: JSDoc/docstrings with their code |
| 65 | +
|
| 66 | +4. CHUNK: Split large blocks with overlap |
| 67 | + └─ native/src/chunker.rs: semantic chunking |
| 68 | + └─ Preserves code structure boundaries |
| 69 | + └─ Adds overlap for context continuity |
| 70 | +
|
| 71 | +5. EMBED: Convert to vectors via AI provider |
| 72 | + └─ src/embeddings/provider.ts |
| 73 | + └─ Deduped by content hash (same code = same embedding) |
| 74 | +
|
| 75 | +6. STORE: Persist to disk |
| 76 | + └─ SQLite: embeddings (by hash), chunks, branch catalog |
| 77 | + └─ usearch: vector index for similarity search |
| 78 | + └─ BM25: inverted index for keyword search |
| 79 | +``` |
| 80 | + |
| 81 | +### Search Flow |
| 82 | + |
| 83 | +``` |
| 84 | +Query → Embed → Search → Rank → Return |
| 85 | +
|
| 86 | +1. EMBED QUERY |
| 87 | + └─ Same embedding model as indexing |
| 88 | + └─ Single API call (~800ms latency) |
| 89 | +
|
| 90 | +2. PARALLEL SEARCH |
| 91 | + ├─ SEMANTIC: usearch cosine similarity |
| 92 | + │ └─ Returns top-K similar vectors |
| 93 | + └─ KEYWORD: BM25 inverted index |
| 94 | + └─ Returns top-K keyword matches |
| 95 | +
|
| 96 | +3. HYBRID FUSION |
| 97 | + └─ Combines semantic + keyword scores |
| 98 | + └─ Weights controlled by hybridWeight config |
| 99 | + └─ Filters by current git branch |
| 100 | +
|
| 101 | +4. BRANCH FILTER |
| 102 | + └─ Only returns chunks existing on current branch |
| 103 | + └─ Prevents stale results from other branches |
| 104 | +
|
| 105 | +5. RETURN RESULTS |
| 106 | + └─ File path, line numbers, code snippet |
| 107 | + └─ Sorted by combined score |
| 108 | +``` |
| 109 | + |
| 110 | +## Component Details |
| 111 | + |
| 112 | +### Indexer (`src/indexer/index.ts`) |
| 113 | + |
| 114 | +The central orchestrator. Responsibilities: |
| 115 | +- Manages full and incremental indexing |
| 116 | +- Coordinates parsing → embedding → storage |
| 117 | +- Handles rate limiting and retries |
| 118 | +- Tracks per-file hashes for delta detection |
| 119 | + |
| 120 | +Key methods: |
| 121 | +| Method | Purpose | |
| 122 | +|--------|---------| |
| 123 | +| `index()` | Main entry: orchestrates full indexing flow | |
| 124 | +| `searchSemantic()` | Pure vector similarity search | |
| 125 | +| `searchHybrid()` | Combines semantic + BM25 | |
| 126 | +| `cleanup()` | Garbage collection for orphaned data | |
| 127 | + |
| 128 | +### Embedding Provider (`src/embeddings/`) |
| 129 | + |
| 130 | +Abstracts different AI embedding APIs: |
| 131 | + |
| 132 | +| Provider | Implementation | Rate Limit Strategy | |
| 133 | +|----------|----------------|---------------------| |
| 134 | +| GitHub Copilot | OAuth + internal API | 1 concurrent, 4s delay | |
| 135 | +| OpenAI | Official API | 3 concurrent, 500ms delay | |
| 136 | +| Google | Gemini API | 5 concurrent, 200ms delay | |
| 137 | +| Ollama | Local REST | 5 concurrent, no delay | |
| 138 | + |
| 139 | +Detection order: GitHub Copilot → OpenAI → Google → Ollama |
| 140 | + |
| 141 | +### Native Module (`native/src/`) |
| 142 | + |
| 143 | +Rust components exposed via NAPI: |
| 144 | + |
| 145 | +| Component | Crate | Purpose | |
| 146 | +|-----------|-------|---------| |
| 147 | +| Parser | tree-sitter-* | Language-aware code parsing | |
| 148 | +| VectorStore | usearch | HNSW vector similarity search | |
| 149 | +| Database | rusqlite | Persistent storage with batch ops | |
| 150 | +| InvertedIndex | Custom | BM25 keyword search | |
| 151 | +| Hasher | xxhash-rust | Fast content hashing | |
| 152 | + |
| 153 | +### Watcher (`src/watcher/index.ts`) |
| 154 | + |
| 155 | +File system observer using chokidar: |
| 156 | +- Watches for file changes → triggers incremental index |
| 157 | +- Watches `.git/HEAD` → detects branch switches |
| 158 | +- Debounces rapid changes (500ms window) |
| 159 | + |
| 160 | +## Design Decisions |
| 161 | + |
| 162 | +### Why Hybrid TypeScript + Rust? |
| 163 | + |
| 164 | +| Layer | Language | Rationale | |
| 165 | +|-------|----------|-----------| |
| 166 | +| Plugin interface | TypeScript | Native OpenCode integration, config parsing | |
| 167 | +| Core logic | TypeScript | Orchestration, API calls, easier iteration | |
| 168 | +| Hot paths | Rust | Performance: parsing, vectors, DB operations | |
| 169 | + |
| 170 | +The 80/20 rule: TypeScript for flexibility, Rust for speed-critical operations. |
| 171 | + |
| 172 | +### Why usearch for Vectors? |
| 173 | + |
| 174 | +Alternatives considered: |
| 175 | +- **FAISS**: Heavier, complex build, overkill for our scale |
| 176 | +- **hnswlib**: Good, but usearch is faster and has F16 support |
| 177 | +- **In-memory arrays**: Too slow for 10k+ vectors |
| 178 | + |
| 179 | +usearch advantages: |
| 180 | +- F16 quantization → 50% memory savings |
| 181 | +- Fast HNSW algorithm |
| 182 | +- Simple C++ core, easy Rust bindings |
| 183 | +- Persistent on-disk index |
| 184 | + |
| 185 | +### Why SQLite for Storage? |
| 186 | + |
| 187 | +Alternatives considered: |
| 188 | +- **JSON files**: No transactions, slow for large data |
| 189 | +- **LevelDB/RocksDB**: Overkill, complex keys |
| 190 | +- **PostgreSQL**: External dependency, overkill |
| 191 | + |
| 192 | +SQLite advantages: |
| 193 | +- Single-file database |
| 194 | +- ACID transactions for batch inserts |
| 195 | +- Fast lookups by content hash |
| 196 | +- Built-in query capabilities |
| 197 | +- Widely supported in Rust |
| 198 | + |
| 199 | +### Why BM25 Hybrid Search? |
| 200 | + |
| 201 | +Pure semantic search has weaknesses: |
| 202 | +- Misses exact identifier matches |
| 203 | +- Can't find "the function named exactly X" |
| 204 | +- Embedding models have knowledge cutoffs |
| 205 | + |
| 206 | +BM25 hybrid provides: |
| 207 | +- Exact keyword matching for precision |
| 208 | +- Fallback when semantic misses |
| 209 | +- Better results for technical queries |
| 210 | +- Configurable weighting (hybridWeight) |
| 211 | + |
| 212 | +### Why Branch-Aware Indexing? |
| 213 | + |
| 214 | +Problem: Switching branches changes code but embeddings are expensive. |
| 215 | + |
| 216 | +Solution: |
| 217 | +1. **Store embeddings by content hash** (not by file) |
| 218 | + - Same code = same embedding, regardless of branch |
| 219 | + - Deduplicated storage |
| 220 | + |
| 221 | +2. **Branch catalog tracks membership** |
| 222 | + - Lightweight: just chunk IDs per branch |
| 223 | + - Instant branch switch (no re-embedding) |
| 224 | + |
| 225 | +3. **Filter search by current branch** |
| 226 | + - Query only returns relevant results |
| 227 | + - No stale results from other branches |
| 228 | + |
| 229 | +### Why Content-Based Deduplication? |
| 230 | + |
| 231 | +Instead of storing embeddings per-file, we hash the content: |
| 232 | +- `hash(code) → embedding_id` |
| 233 | +- Same utility function across files? One embedding. |
| 234 | +- Copy-paste code? Already embedded. |
| 235 | + |
| 236 | +Benefits: |
| 237 | +- Reduces token costs (don't re-embed duplicates) |
| 238 | +- Smaller index size |
| 239 | +- Faster incremental indexing |
| 240 | + |
| 241 | +## Performance Characteristics |
| 242 | + |
| 243 | +### Indexing Performance |
| 244 | + |
| 245 | +| Phase | Time Complexity | Actual Performance | |
| 246 | +|-------|-----------------|-------------------| |
| 247 | +| File collection | O(n files) | ~10ms for 1000 files | |
| 248 | +| Parsing | O(n files × file size) | ~7ms for 100 files | |
| 249 | +| Embedding | O(n chunks) × API latency | Bottleneck (rate limited) | |
| 250 | +| Storage | O(n chunks) | ~4ms for 1000 chunks (batch) | |
| 251 | + |
| 252 | +### Search Performance |
| 253 | + |
| 254 | +| Phase | Time Complexity | Actual Performance | |
| 255 | +|-------|-----------------|-------------------| |
| 256 | +| Query embedding | O(1) API call | ~800-1000ms | |
| 257 | +| Vector search | O(log n) HNSW | ~1ms for 10k vectors | |
| 258 | +| BM25 search | O(n tokens) | ~5ms for 50k tokens | |
| 259 | +| Result fusion | O(k results) | <1ms | |
| 260 | + |
| 261 | +**Total search latency**: ~800-1000ms (dominated by embedding API call) |
| 262 | + |
| 263 | +### Memory Usage |
| 264 | + |
| 265 | +| Component | Memory Profile | |
| 266 | +|-----------|----------------| |
| 267 | +| Vector index | ~3KB per chunk (F16 quantization) | |
| 268 | +| SQLite | ~1KB per chunk metadata | |
| 269 | +| BM25 index | ~2KB per unique token | |
| 270 | + |
| 271 | +For a typical 500-file codebase (~5000 chunks): ~30MB total |
| 272 | + |
| 273 | +## Security Considerations |
| 274 | + |
| 275 | +### What Gets Sent to Cloud |
| 276 | + |
| 277 | +| Data | Destination | Purpose | |
| 278 | +|------|-------------|---------| |
| 279 | +| Code chunks | Embedding provider | Vector generation | |
| 280 | +| Search queries | Embedding provider | Query embedding | |
| 281 | + |
| 282 | +The vector index itself stays local. Only code/queries go to the embedding API. |
| 283 | + |
| 284 | +### Privacy Options |
| 285 | + |
| 286 | +For maximum privacy, use Ollama: |
| 287 | +```json |
| 288 | +{ "embeddingProvider": "ollama" } |
| 289 | +``` |
| 290 | +All processing happens locally. Nothing leaves your machine. |
| 291 | + |
| 292 | +### Credential Handling |
| 293 | + |
| 294 | +- GitHub Copilot: Uses OpenCode's OAuth token |
| 295 | +- OpenAI/Google: Reads from environment variables |
| 296 | +- Ollama: Local REST, no credentials needed |
| 297 | + |
| 298 | +No credentials are stored by the plugin. |
| 299 | + |
| 300 | +## Extending the Architecture |
| 301 | + |
| 302 | +### Adding a New Language |
| 303 | + |
| 304 | +1. Add tree-sitter grammar to `native/Cargo.toml` |
| 305 | +2. Update `native/src/types.rs`: `Language` enum |
| 306 | +3. Update `native/src/parser.rs`: |
| 307 | + - `ts_language()` match arm |
| 308 | + - `is_comment_node()` patterns |
| 309 | + - `is_semantic_node()` patterns |
| 310 | +4. Add tests in `native/src/parser.rs` |
| 311 | + |
| 312 | +### Adding a New Embedding Provider |
| 313 | + |
| 314 | +1. Add detection in `src/embeddings/detector.ts` |
| 315 | +2. Implement embed function in `src/embeddings/provider.ts` |
| 316 | +3. Add rate limit config in `src/indexer/index.ts` |
| 317 | + |
| 318 | +### Adding a New Storage Backend |
| 319 | + |
| 320 | +1. Implement storage interface (see `native/src/db.rs`) |
| 321 | +2. Expose via NAPI in `native/src/lib.rs` |
| 322 | +3. Update `src/native/index.ts` wrapper |
| 323 | +4. Update `src/indexer/index.ts` to use new backend |
0 commit comments