Skip to content

Commit 1c58f26

Browse files
committed
feat: add TOML and YAML parsing support with tree-sitter
1 parent 4e0d380 commit 1c58f26

File tree

5 files changed

+241
-2
lines changed

5 files changed

+241
-2
lines changed

AGENTS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ src/
4747
4848
native/src/
4949
├── lib.rs # NAPI exports: parse_file, VectorStore, Database, InvertedIndex
50-
├── parser.rs # Tree-sitter parsing (12 languages: TS, JS, Python, Rust, Go, Java, C#, Ruby, Bash, C, C++, JSON)
50+
├── parser.rs # Tree-sitter parsing (14 languages: TS, JS, Python, Rust, Go, Java, C#, Ruby, Bash, C, C++, JSON, TOML, YAML)
5151
├── chunker.rs # Semantic chunking with overlap
5252
├── store.rs # usearch vector store (F16 quantization)
5353
├── db.rs # SQLite: embeddings, chunks, branch catalog

CHANGELOG.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [0.3.0] - 2025-01-16
9+
10+
### Added
11+
- **Language support**: Java, C#, Ruby, Bash, C, and C++ parsing via tree-sitter
12+
- **CI improvements**: Rust caching, `cargo fmt --check`, `cargo clippy`, and `cargo test` in workflows
13+
- **/status command**: Check index health and provider info
14+
- **Batch operations**: High-performance bulk inserts for embeddings and chunks (~10-18x speedup)
15+
- **Auto garbage collection**: Configurable automatic cleanup of orphaned embeddings/chunks
16+
- **Documentation**: ARCHITECTURE.md, TROUBLESHOOTING.md, comprehensive AGENTS.md
17+
18+
### Changed
19+
- Upgraded tree-sitter from 0.20 to 0.24 (new LANGUAGE constant API)
20+
- Optimized `embedBatch` for Google and Ollama providers with Promise.all
21+
- Enhanced skill documentation with filter examples
22+
23+
### Fixed
24+
- Node version consistency in publish workflow (Node 24 → Node 22)
25+
- Clippy warnings in Rust code
26+
27+
## [0.2.1] - 2025-01-10
28+
29+
### Fixed
30+
- Rate limit handling and error messages
31+
- TypeScript errors in delta.ts
32+
33+
## [0.2.0] - 2025-01-09
34+
35+
### Added
36+
- **Branch-aware indexing**: Embeddings stored by content hash, branch catalog tracks membership
37+
- **SQLite storage**: Persistent storage for embeddings, chunks, and branch catalog
38+
- **Slash commands**: `/search`, `/find`, `/index`, `/status` registered via config hook
39+
- **Global config support**: `~/.config/opencode/codebase-index.json`
40+
- **Provider-specific rate limiting**: Ollama has no limits, GitHub Copilot has strict limits
41+
42+
### Changed
43+
- Migrated from JSON file storage to SQLite database
44+
- Improved rate limit handling for GitHub Models API (15 req/min)
45+
46+
## [0.1.11] - 2025-01-07
47+
48+
### Added
49+
- Community standards: LICENSE, Code of Conduct, Contributing guide, Security policy, Issue templates
50+
51+
### Fixed
52+
- Clippy warnings and TypeScript type errors
53+
54+
## [0.1.10] - 2025-01-06
55+
56+
### Added
57+
- **F16 quantization**: 50% memory reduction for vector storage
58+
- **Dead-letter queue**: Failed embedding batches are tracked for retry
59+
- **JSDoc/docstring extraction**: Comments included with semantic nodes
60+
- **Overlapping chunks**: Improved context continuity across chunk boundaries
61+
- **maxChunksPerFile config**: Control token costs for large files
62+
- **semanticOnly config**: Only index functions/classes, skip generic blocks
63+
64+
### Changed
65+
- Moved inverted index from TypeScript to Rust native module (performance improvement)
66+
67+
### Fixed
68+
- GitHub Models API for embeddings instead of Copilot API
69+
70+
## [0.1.9] - 2025-01-05
71+
72+
### Fixed
73+
- Use GitHub Models API for embeddings instead of Copilot API
74+
75+
## [0.1.8] - 2025-01-04
76+
77+
### Fixed
78+
- Only export default plugin to prevent OpenCode loader crash
79+
- Downgrade to zod v3 to match OpenCode SDK version
80+
81+
## [0.1.3] - 2025-01-02
82+
83+
### Changed
84+
- Use Node.js 24 for npm 11+ trusted publishing support
85+
- Externalize @opencode-ai/plugin to prevent runtime conflicts
86+
87+
### Fixed
88+
- ESM output as main entry for Bun/OpenCode compatibility
89+
- Native binding loading in CJS context
90+
91+
## [0.1.1] - 2025-01-01
92+
93+
### Added
94+
- CI/CD workflows for testing and publishing
95+
- Comprehensive README with badges, diagrams, and examples
96+
97+
### Fixed
98+
- NAPI configuration for OIDC trusted publishing
99+
100+
## [0.1.0] - 2024-12-30
101+
102+
### Added
103+
- **Initial release**
104+
- Semantic codebase indexing with tree-sitter parsing
105+
- Vector similarity search with usearch (HNSW algorithm)
106+
- Hybrid search combining semantic + BM25 keyword matching
107+
- Support for TypeScript, JavaScript, Python, Rust, Go, JSON
108+
- Multiple embedding providers: GitHub Copilot, OpenAI, Google, Ollama
109+
- Incremental indexing with file hash caching
110+
- File watcher for automatic re-indexing
111+
- OpenCode tools: `codebase_search`, `index_codebase`, `index_status`, `index_health_check`
112+
113+
[0.3.0]: https://github.com/Helweg/opencode-codebase-index/compare/v0.2.1...v0.3.0
114+
[0.2.1]: https://github.com/Helweg/opencode-codebase-index/compare/v0.2.0...v0.2.1
115+
[0.2.0]: https://github.com/Helweg/opencode-codebase-index/compare/v0.1.11...v0.2.0
116+
[0.1.11]: https://github.com/Helweg/opencode-codebase-index/compare/v0.1.10...v0.1.11
117+
[0.1.10]: https://github.com/Helweg/opencode-codebase-index/compare/v0.1.9...v0.1.10
118+
[0.1.9]: https://github.com/Helweg/opencode-codebase-index/compare/v0.1.8...v0.1.9
119+
[0.1.8]: https://github.com/Helweg/opencode-codebase-index/compare/v0.1.3...v0.1.8
120+
[0.1.3]: https://github.com/Helweg/opencode-codebase-index/compare/v0.1.1...v0.1.3
121+
[0.1.1]: https://github.com/Helweg/opencode-codebase-index/compare/v0.1.0...v0.1.1
122+
[0.1.0]: https://github.com/Helweg/opencode-codebase-index/releases/tag/v0.1.0

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ graph TD
118118

119119
1. **Parsing**: We use `tree-sitter` to intelligently parse your code into meaningful blocks (functions, classes, interfaces). JSDoc comments and docstrings are automatically included with their associated code.
120120

121-
**Supported Languages**: TypeScript, JavaScript, Python, Rust, Go, Java, C#, Ruby, Bash, C, C++, JSON
121+
**Supported Languages**: TypeScript, JavaScript, Python, Rust, Go, Java, C#, Ruby, Bash, C, C++, JSON, TOML, YAML
122122
2. **Chunking**: Large blocks are split with overlapping windows to preserve context across chunk boundaries.
123123
3. **Embedding**: These blocks are converted into vector representations using your configured AI provider.
124124
4. **Storage**: Embeddings are stored in SQLite (deduplicated by content hash) and vectors in `usearch` with F16 quantization for 50% memory savings. A branch catalog tracks which chunks exist on each branch.

native/Cargo.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,9 @@ tree-sitter-ruby = "0.23"
2525
tree-sitter-bash = "0.23"
2626
tree-sitter-c = "0.23"
2727
tree-sitter-cpp = "0.23"
28+
tree-sitter-toml-ng = "0.7"
29+
tree-sitter-yaml = "0.7"
30+
tree-sitter-language = "0.1"
2831

2932
usearch = "2.23"
3033

native/src/parser.rs

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ pub fn parse_file_internal(file_path: &str, content: &str) -> Result<Vec<CodeChu
3939
Language::Bash => tree_sitter_bash::LANGUAGE.into(),
4040
Language::C => tree_sitter_c::LANGUAGE.into(),
4141
Language::Cpp => tree_sitter_cpp::LANGUAGE.into(),
42+
Language::Toml => tree_sitter_toml_ng::LANGUAGE.into(),
43+
Language::Yaml => tree_sitter_yaml::LANGUAGE.into(),
4244
_ => return Ok(chunk_by_lines(content, &language)),
4345
};
4446

@@ -211,6 +213,12 @@ fn is_comment_node(node_type: &str, language: &Language) -> bool {
211213
Language::C | Language::Cpp => {
212214
matches!(node_type, "comment")
213215
}
216+
Language::Toml => {
217+
matches!(node_type, "comment")
218+
}
219+
Language::Yaml => {
220+
matches!(node_type, "comment")
221+
}
214222
_ => false,
215223
}
216224
}
@@ -309,6 +317,12 @@ fn is_semantic_node(node_type: &str, language: &Language) -> bool {
309317
| "template_declaration"
310318
)
311319
}
320+
Language::Toml => {
321+
matches!(node_type, "table" | "table_array_element")
322+
}
323+
Language::Yaml => {
324+
matches!(node_type, "block_mapping_pair" | "block_sequence")
325+
}
312326
_ => false,
313327
}
314328
}
@@ -771,4 +785,104 @@ T max(T a, T b) {
771785
"Should find class_specifier or namespace_definition"
772786
);
773787
}
788+
789+
#[test]
790+
fn test_parse_toml() {
791+
let content = r#"
792+
# This is a TOML configuration file
793+
794+
[package]
795+
name = "my-project"
796+
version = "1.0.0"
797+
edition = "2021"
798+
799+
[dependencies]
800+
serde = { version = "1.0", features = ["derive"] }
801+
tokio = "1.0"
802+
803+
[[bin]]
804+
name = "my-app"
805+
path = "src/main.rs"
806+
807+
[profile.release]
808+
lto = true
809+
opt-level = 3
810+
"#;
811+
812+
let chunks = parse_file_internal("Cargo.toml", content).unwrap();
813+
assert!(!chunks.is_empty(), "Should have chunks for TOML");
814+
815+
let has_table = chunks.iter().any(|c| c.chunk_type == "table");
816+
let has_table_array = chunks.iter().any(|c| c.chunk_type == "table_array_element");
817+
assert!(
818+
has_table || has_table_array,
819+
"Should find table or table_array_element"
820+
);
821+
}
822+
823+
#[test]
824+
fn test_parse_yaml() {
825+
let content = r#"
826+
# Kubernetes deployment config
827+
apiVersion: apps/v1
828+
kind: Deployment
829+
metadata:
830+
name: my-app
831+
labels:
832+
app: my-app
833+
spec:
834+
replicas: 3
835+
selector:
836+
matchLabels:
837+
app: my-app
838+
template:
839+
metadata:
840+
labels:
841+
app: my-app
842+
spec:
843+
containers:
844+
- name: my-app
845+
image: my-app:latest
846+
ports:
847+
- containerPort: 8080
848+
"#;
849+
850+
let chunks = parse_file_internal("deployment.yaml", content).unwrap();
851+
assert!(!chunks.is_empty(), "Should have chunks for YAML");
852+
}
853+
854+
#[test]
855+
fn test_parse_markdown_fallback() {
856+
let content = r#"
857+
# My Project
858+
859+
This is a **markdown** file with various content.
860+
861+
## Installation
862+
863+
```bash
864+
npm install my-project
865+
```
866+
867+
## Usage
868+
869+
Here's how to use the library:
870+
871+
```typescript
872+
import { myFunction } from 'my-project';
873+
myFunction();
874+
```
875+
876+
## Contributing
877+
878+
Please read CONTRIBUTING.md for details.
879+
"#;
880+
881+
let chunks = parse_file_internal("README.md", content).unwrap();
882+
// Markdown falls back to line-based chunking
883+
assert!(!chunks.is_empty(), "Should have chunks for Markdown");
884+
// Should be block type since we use line-based chunking
885+
let has_block = chunks.iter().any(|c| c.chunk_type == "block");
886+
assert!(has_block, "Markdown should use block chunking");
887+
}
774888
}

0 commit comments

Comments
 (0)