Skip to content

Commit 4e0d380

Browse files
committed
docs: add ARCHITECTURE.md, TROUBLESHOOTING.md and update AGENTS.md
1 parent bc254fa commit 4e0d380

File tree

3 files changed

+675
-1
lines changed

3 files changed

+675
-1
lines changed

AGENTS.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ src/
4747
4848
native/src/
4949
├── lib.rs # NAPI exports: parse_file, VectorStore, Database, InvertedIndex
50-
├── parser.rs # Tree-sitter parsing (TS, JS, Python, Rust, Go, JSON)
50+
├── parser.rs # Tree-sitter parsing (12 languages: TS, JS, Python, Rust, Go, Java, C#, Ruby, Bash, C, C++, JSON)
5151
├── chunker.rs # Semantic chunking with overlap
5252
├── store.rs # usearch vector store (F16 quantization)
5353
├── db.rs # SQLite: embeddings, chunks, branch catalog
@@ -197,6 +197,7 @@ afterEach(() => { fs.rmSync(tempDir, { recursive: true, force: true }); });
197197
| `watcher.test.ts` | File/git branch watching |
198198
| `auto-gc.test.ts` | Automatic garbage collection |
199199
| `git.test.ts` | Git branch detection |
200+
| `commands.test.ts` | Slash command loader, frontmatter parsing |
200201

201202
### Benchmarks
202203
```bash

ARCHITECTURE.md

Lines changed: 323 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,323 @@
1+
# Architecture Overview
2+
3+
This document explains the architecture of opencode-codebase-index, including data flow, component interactions, and key design decisions.
4+
5+
## High-Level Architecture
6+
7+
```
8+
┌─────────────────────────────────────────────────────────────────────────────┐
9+
│ OpenCode Agent │
10+
│ │
11+
│ Tools: codebase_search, index_codebase, index_status, index_health_check │
12+
│ Commands: /search, /find, /index, /status │
13+
└─────────────────────────────────────────────────────────────────────────────┘
14+
15+
16+
┌─────────────────────────────────────────────────────────────────────────────┐
17+
│ TypeScript Layer │
18+
│ │
19+
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
20+
│ │ Indexer │ │ Embeddings │ │ Watcher │ │ Git │ │
21+
│ │ │ │ Provider │ │ │ │ Detector │ │
22+
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
23+
└─────────────────────────────────────────────────────────────────────────────┘
24+
25+
26+
┌─────────────────────────────────────────────────────────────────────────────┐
27+
│ Rust Native Module (NAPI) │
28+
│ │
29+
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
30+
│ │ Tree-sitter │ │ usearch │ │ SQLite │ │ BM25 │ │
31+
│ │ Parser │ │ Vectors │ │ Database │ │ Inverted Idx │ │
32+
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
33+
└─────────────────────────────────────────────────────────────────────────────┘
34+
35+
36+
┌─────────────────────────────────────────────────────────────────────────────┐
37+
│ Storage Layer │
38+
│ │
39+
│ .opencode/index/ │
40+
│ ├── codebase.db # SQLite: embeddings, chunks, branch catalog │
41+
│ ├── vectors.usearch # Vector index (uSearch) │
42+
│ ├── inverted-index.json # BM25 keyword index │
43+
│ └── file-hashes.json # File change detection │
44+
└─────────────────────────────────────────────────────────────────────────────┘
45+
```
46+
47+
## Data Flow
48+
49+
### Indexing Flow
50+
51+
```
52+
Source Files → Parse → Chunk → Embed → Store
53+
54+
1. COLLECT: File discovery (respects .gitignore)
55+
└─ src/utils/files.ts: collectFiles()
56+
57+
2. DELTA: Check what's changed
58+
└─ Compare file hashes (xxhash) against stored hashes
59+
└─ Only process new/modified files
60+
61+
3. PARSE: Tree-sitter language-aware parsing
62+
└─ native/src/parser.rs: parse_file()
63+
└─ Extracts: functions, classes, methods, interfaces
64+
└─ Includes: JSDoc/docstrings with their code
65+
66+
4. CHUNK: Split large blocks with overlap
67+
└─ native/src/chunker.rs: semantic chunking
68+
└─ Preserves code structure boundaries
69+
└─ Adds overlap for context continuity
70+
71+
5. EMBED: Convert to vectors via AI provider
72+
└─ src/embeddings/provider.ts
73+
└─ Deduped by content hash (same code = same embedding)
74+
75+
6. STORE: Persist to disk
76+
└─ SQLite: embeddings (by hash), chunks, branch catalog
77+
└─ usearch: vector index for similarity search
78+
└─ BM25: inverted index for keyword search
79+
```
80+
81+
### Search Flow
82+
83+
```
84+
Query → Embed → Search → Rank → Return
85+
86+
1. EMBED QUERY
87+
└─ Same embedding model as indexing
88+
└─ Single API call (~800ms latency)
89+
90+
2. PARALLEL SEARCH
91+
├─ SEMANTIC: usearch cosine similarity
92+
│ └─ Returns top-K similar vectors
93+
└─ KEYWORD: BM25 inverted index
94+
└─ Returns top-K keyword matches
95+
96+
3. HYBRID FUSION
97+
└─ Combines semantic + keyword scores
98+
└─ Weights controlled by hybridWeight config
99+
└─ Filters by current git branch
100+
101+
4. BRANCH FILTER
102+
└─ Only returns chunks existing on current branch
103+
└─ Prevents stale results from other branches
104+
105+
5. RETURN RESULTS
106+
└─ File path, line numbers, code snippet
107+
└─ Sorted by combined score
108+
```
109+
110+
## Component Details
111+
112+
### Indexer (`src/indexer/index.ts`)
113+
114+
The central orchestrator. Responsibilities:
115+
- Manages full and incremental indexing
116+
- Coordinates parsing → embedding → storage
117+
- Handles rate limiting and retries
118+
- Tracks per-file hashes for delta detection
119+
120+
Key methods:
121+
| Method | Purpose |
122+
|--------|---------|
123+
| `index()` | Main entry: orchestrates full indexing flow |
124+
| `searchSemantic()` | Pure vector similarity search |
125+
| `searchHybrid()` | Combines semantic + BM25 |
126+
| `cleanup()` | Garbage collection for orphaned data |
127+
128+
### Embedding Provider (`src/embeddings/`)
129+
130+
Abstracts different AI embedding APIs:
131+
132+
| Provider | Implementation | Rate Limit Strategy |
133+
|----------|----------------|---------------------|
134+
| GitHub Copilot | OAuth + internal API | 1 concurrent, 4s delay |
135+
| OpenAI | Official API | 3 concurrent, 500ms delay |
136+
| Google | Gemini API | 5 concurrent, 200ms delay |
137+
| Ollama | Local REST | 5 concurrent, no delay |
138+
139+
Detection order: GitHub Copilot → OpenAI → Google → Ollama
140+
141+
### Native Module (`native/src/`)
142+
143+
Rust components exposed via NAPI:
144+
145+
| Component | Crate | Purpose |
146+
|-----------|-------|---------|
147+
| Parser | tree-sitter-* | Language-aware code parsing |
148+
| VectorStore | usearch | HNSW vector similarity search |
149+
| Database | rusqlite | Persistent storage with batch ops |
150+
| InvertedIndex | Custom | BM25 keyword search |
151+
| Hasher | xxhash-rust | Fast content hashing |
152+
153+
### Watcher (`src/watcher/index.ts`)
154+
155+
File system observer using chokidar:
156+
- Watches for file changes → triggers incremental index
157+
- Watches `.git/HEAD` → detects branch switches
158+
- Debounces rapid changes (500ms window)
159+
160+
## Design Decisions
161+
162+
### Why Hybrid TypeScript + Rust?
163+
164+
| Layer | Language | Rationale |
165+
|-------|----------|-----------|
166+
| Plugin interface | TypeScript | Native OpenCode integration, config parsing |
167+
| Core logic | TypeScript | Orchestration, API calls, easier iteration |
168+
| Hot paths | Rust | Performance: parsing, vectors, DB operations |
169+
170+
The 80/20 rule: TypeScript for flexibility, Rust for speed-critical operations.
171+
172+
### Why usearch for Vectors?
173+
174+
Alternatives considered:
175+
- **FAISS**: Heavier, complex build, overkill for our scale
176+
- **hnswlib**: Good, but usearch is faster and has F16 support
177+
- **In-memory arrays**: Too slow for 10k+ vectors
178+
179+
usearch advantages:
180+
- F16 quantization → 50% memory savings
181+
- Fast HNSW algorithm
182+
- Simple C++ core, easy Rust bindings
183+
- Persistent on-disk index
184+
185+
### Why SQLite for Storage?
186+
187+
Alternatives considered:
188+
- **JSON files**: No transactions, slow for large data
189+
- **LevelDB/RocksDB**: Overkill, complex keys
190+
- **PostgreSQL**: External dependency, overkill
191+
192+
SQLite advantages:
193+
- Single-file database
194+
- ACID transactions for batch inserts
195+
- Fast lookups by content hash
196+
- Built-in query capabilities
197+
- Widely supported in Rust
198+
199+
### Why BM25 Hybrid Search?
200+
201+
Pure semantic search has weaknesses:
202+
- Misses exact identifier matches
203+
- Can't find "the function named exactly X"
204+
- Embedding models have knowledge cutoffs
205+
206+
BM25 hybrid provides:
207+
- Exact keyword matching for precision
208+
- Fallback when semantic misses
209+
- Better results for technical queries
210+
- Configurable weighting (hybridWeight)
211+
212+
### Why Branch-Aware Indexing?
213+
214+
Problem: Switching branches changes code but embeddings are expensive.
215+
216+
Solution:
217+
1. **Store embeddings by content hash** (not by file)
218+
- Same code = same embedding, regardless of branch
219+
- Deduplicated storage
220+
221+
2. **Branch catalog tracks membership**
222+
- Lightweight: just chunk IDs per branch
223+
- Instant branch switch (no re-embedding)
224+
225+
3. **Filter search by current branch**
226+
- Query only returns relevant results
227+
- No stale results from other branches
228+
229+
### Why Content-Based Deduplication?
230+
231+
Instead of storing embeddings per-file, we hash the content:
232+
- `hash(code) → embedding_id`
233+
- Same utility function across files? One embedding.
234+
- Copy-paste code? Already embedded.
235+
236+
Benefits:
237+
- Reduces token costs (don't re-embed duplicates)
238+
- Smaller index size
239+
- Faster incremental indexing
240+
241+
## Performance Characteristics
242+
243+
### Indexing Performance
244+
245+
| Phase | Time Complexity | Actual Performance |
246+
|-------|-----------------|-------------------|
247+
| File collection | O(n files) | ~10ms for 1000 files |
248+
| Parsing | O(n files × file size) | ~7ms for 100 files |
249+
| Embedding | O(n chunks) × API latency | Bottleneck (rate limited) |
250+
| Storage | O(n chunks) | ~4ms for 1000 chunks (batch) |
251+
252+
### Search Performance
253+
254+
| Phase | Time Complexity | Actual Performance |
255+
|-------|-----------------|-------------------|
256+
| Query embedding | O(1) API call | ~800-1000ms |
257+
| Vector search | O(log n) HNSW | ~1ms for 10k vectors |
258+
| BM25 search | O(n tokens) | ~5ms for 50k tokens |
259+
| Result fusion | O(k results) | <1ms |
260+
261+
**Total search latency**: ~800-1000ms (dominated by embedding API call)
262+
263+
### Memory Usage
264+
265+
| Component | Memory Profile |
266+
|-----------|----------------|
267+
| Vector index | ~3KB per chunk (F16 quantization) |
268+
| SQLite | ~1KB per chunk metadata |
269+
| BM25 index | ~2KB per unique token |
270+
271+
For a typical 500-file codebase (~5000 chunks): ~30MB total
272+
273+
## Security Considerations
274+
275+
### What Gets Sent to Cloud
276+
277+
| Data | Destination | Purpose |
278+
|------|-------------|---------|
279+
| Code chunks | Embedding provider | Vector generation |
280+
| Search queries | Embedding provider | Query embedding |
281+
282+
The vector index itself stays local. Only code/queries go to the embedding API.
283+
284+
### Privacy Options
285+
286+
For maximum privacy, use Ollama:
287+
```json
288+
{ "embeddingProvider": "ollama" }
289+
```
290+
All processing happens locally. Nothing leaves your machine.
291+
292+
### Credential Handling
293+
294+
- GitHub Copilot: Uses OpenCode's OAuth token
295+
- OpenAI/Google: Reads from environment variables
296+
- Ollama: Local REST, no credentials needed
297+
298+
No credentials are stored by the plugin.
299+
300+
## Extending the Architecture
301+
302+
### Adding a New Language
303+
304+
1. Add tree-sitter grammar to `native/Cargo.toml`
305+
2. Update `native/src/types.rs`: `Language` enum
306+
3. Update `native/src/parser.rs`:
307+
- `ts_language()` match arm
308+
- `is_comment_node()` patterns
309+
- `is_semantic_node()` patterns
310+
4. Add tests in `native/src/parser.rs`
311+
312+
### Adding a New Embedding Provider
313+
314+
1. Add detection in `src/embeddings/detector.ts`
315+
2. Implement embed function in `src/embeddings/provider.ts`
316+
3. Add rate limit config in `src/indexer/index.ts`
317+
318+
### Adding a New Storage Backend
319+
320+
1. Implement storage interface (see `native/src/db.rs`)
321+
2. Expose via NAPI in `native/src/lib.rs`
322+
3. Update `src/native/index.ts` wrapper
323+
4. Update `src/indexer/index.ts` to use new backend

0 commit comments

Comments
 (0)