feat: add TOML and YAML parsing support with tree-sitter

Helweg · Helweg · commit 1c58f2656a1b · 2026-01-17T21:49:13.000+01:00
diff --git a/AGENTS.md b/AGENTS.md
@@ -47,7 +47,7 @@ src/
 
 native/src/
 ├── lib.rs                # NAPI exports: parse_file, VectorStore, Database, InvertedIndex
-├── parser.rs             # Tree-sitter parsing (12 languages: TS, JS, Python, Rust, Go, Java, C#, Ruby, Bash, C, C++, JSON)
+├── parser.rs             # Tree-sitter parsing (14 languages: TS, JS, Python, Rust, Go, Java, C#, Ruby, Bash, C, C++, JSON, TOML, YAML)
 ├── chunker.rs            # Semantic chunking with overlap
 ├── store.rs              # usearch vector store (F16 quantization)
 ├── db.rs                 # SQLite: embeddings, chunks, branch catalog
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,122 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [0.3.0] - 2025-01-16
+
+### Added
+- **Language support**: Java, C#, Ruby, Bash, C, and C++ parsing via tree-sitter
+- **CI improvements**: Rust caching, `cargo fmt --check`, `cargo clippy`, and `cargo test` in workflows
+- **/status command**: Check index health and provider info
+- **Batch operations**: High-performance bulk inserts for embeddings and chunks (~10-18x speedup)
+- **Auto garbage collection**: Configurable automatic cleanup of orphaned embeddings/chunks
+- **Documentation**: ARCHITECTURE.md, TROUBLESHOOTING.md, comprehensive AGENTS.md
+
+### Changed
+- Upgraded tree-sitter from 0.20 to 0.24 (new LANGUAGE constant API)
+- Optimized `embedBatch` for Google and Ollama providers with Promise.all
+- Enhanced skill documentation with filter examples
+
+### Fixed
+- Node version consistency in publish workflow (Node 24 → Node 22)
+- Clippy warnings in Rust code
+
+## [0.2.1] - 2025-01-10
+
+### Fixed
+- Rate limit handling and error messages
+- TypeScript errors in delta.ts
+
+## [0.2.0] - 2025-01-09
+
+### Added
+- **Branch-aware indexing**: Embeddings stored by content hash, branch catalog tracks membership
+- **SQLite storage**: Persistent storage for embeddings, chunks, and branch catalog
+- **Slash commands**: `/search`, `/find`, `/index`, `/status` registered via config hook
+- **Global config support**: `~/.config/opencode/codebase-index.json`
+- **Provider-specific rate limiting**: Ollama has no limits, GitHub Copilot has strict limits
+
+### Changed
+- Migrated from JSON file storage to SQLite database
+- Improved rate limit handling for GitHub Models API (15 req/min)
+
+## [0.1.11] - 2025-01-07
+
+### Added
+- Community standards: LICENSE, Code of Conduct, Contributing guide, Security policy, Issue templates
+
+### Fixed
+- Clippy warnings and TypeScript type errors
+
+## [0.1.10] - 2025-01-06
+
+### Added
+- **F16 quantization**: 50% memory reduction for vector storage
+- **Dead-letter queue**: Failed embedding batches are tracked for retry
+- **JSDoc/docstring extraction**: Comments included with semantic nodes
+- **Overlapping chunks**: Improved context continuity across chunk boundaries
+- **maxChunksPerFile config**: Control token costs for large files
+- **semanticOnly config**: Only index functions/classes, skip generic blocks
+
+### Changed
+- Moved inverted index from TypeScript to Rust native module (performance improvement)
+
+### Fixed
+- GitHub Models API for embeddings instead of Copilot API
+
+## [0.1.9] - 2025-01-05
+
+### Fixed
+- Use GitHub Models API for embeddings instead of Copilot API
+
+## [0.1.8] - 2025-01-04
+
+### Fixed
+- Only export default plugin to prevent OpenCode loader crash
+- Downgrade to zod v3 to match OpenCode SDK version
+
+## [0.1.3] - 2025-01-02
+
+### Changed
+- Use Node.js 24 for npm 11+ trusted publishing support
+- Externalize @opencode-ai/plugin to prevent runtime conflicts
+
+### Fixed
+- ESM output as main entry for Bun/OpenCode compatibility
+- Native binding loading in CJS context
+
+## [0.1.1] - 2025-01-01
+
+### Added
+- CI/CD workflows for testing and publishing
+- Comprehensive README with badges, diagrams, and examples
+
+### Fixed
+- NAPI configuration for OIDC trusted publishing
+
+## [0.1.0] - 2024-12-30
+
+### Added
+- **Initial release**
+- Semantic codebase indexing with tree-sitter parsing
+- Vector similarity search with usearch (HNSW algorithm)
+- Hybrid search combining semantic + BM25 keyword matching
+- Support for TypeScript, JavaScript, Python, Rust, Go, JSON
+- Multiple embedding providers: GitHub Copilot, OpenAI, Google, Ollama
+- Incremental indexing with file hash caching
+- File watcher for automatic re-indexing
+- OpenCode tools: `codebase_search`, `index_codebase`, `index_status`, `index_health_check`
+
+[0.3.0]: https://github.com/Helweg/opencode-codebase-index/compare/v0.2.1...v0.3.0
+[0.2.1]: https://github.com/Helweg/opencode-codebase-index/compare/v0.2.0...v0.2.1
+[0.2.0]: https://github.com/Helweg/opencode-codebase-index/compare/v0.1.11...v0.2.0
+[0.1.11]: https://github.com/Helweg/opencode-codebase-index/compare/v0.1.10...v0.1.11
+[0.1.10]: https://github.com/Helweg/opencode-codebase-index/compare/v0.1.9...v0.1.10
+[0.1.9]: https://github.com/Helweg/opencode-codebase-index/compare/v0.1.8...v0.1.9
+[0.1.8]: https://github.com/Helweg/opencode-codebase-index/compare/v0.1.3...v0.1.8
+[0.1.3]: https://github.com/Helweg/opencode-codebase-index/compare/v0.1.1...v0.1.3
+[0.1.1]: https://github.com/Helweg/opencode-codebase-index/compare/v0.1.0...v0.1.1
+[0.1.0]: https://github.com/Helweg/opencode-codebase-index/releases/tag/v0.1.0
diff --git a/README.md b/README.md
@@ -118,7 +118,7 @@ graph TD
 
 1. **Parsing**: We use `tree-sitter` to intelligently parse your code into meaningful blocks (functions, classes, interfaces). JSDoc comments and docstrings are automatically included with their associated code.
 
-**Supported Languages**: TypeScript, JavaScript, Python, Rust, Go, Java, C#, Ruby, Bash, C, C++, JSON
+**Supported Languages**: TypeScript, JavaScript, Python, Rust, Go, Java, C#, Ruby, Bash, C, C++, JSON, TOML, YAML
 2. **Chunking**: Large blocks are split with overlapping windows to preserve context across chunk boundaries.
 3. **Embedding**: These blocks are converted into vector representations using your configured AI provider.
 4. **Storage**: Embeddings are stored in SQLite (deduplicated by content hash) and vectors in `usearch` with F16 quantization for 50% memory savings. A branch catalog tracks which chunks exist on each branch.
diff --git a/native/Cargo.toml b/native/Cargo.toml
@@ -25,6 +25,9 @@ tree-sitter-ruby = "0.23"
 tree-sitter-bash = "0.23"
 tree-sitter-c = "0.23"
 tree-sitter-cpp = "0.23"
+tree-sitter-toml-ng = "0.7"
+tree-sitter-yaml = "0.7"
+tree-sitter-language = "0.1"
 
 usearch = "2.23"
 
diff --git a/native/src/parser.rs b/native/src/parser.rs
@@ -39,6 +39,8 @@ pub fn parse_file_internal(file_path: &str, content: &str) -> Result<Vec<CodeChu
         Language::Bash => tree_sitter_bash::LANGUAGE.into(),
         Language::C => tree_sitter_c::LANGUAGE.into(),
         Language::Cpp => tree_sitter_cpp::LANGUAGE.into(),
+        Language::Toml => tree_sitter_toml_ng::LANGUAGE.into(),
+        Language::Yaml => tree_sitter_yaml::LANGUAGE.into(),
         _ => return Ok(chunk_by_lines(content, &language)),
     };
 
@@ -211,6 +213,12 @@ fn is_comment_node(node_type: &str, language: &Language) -> bool {
         Language::C | Language::Cpp => {
             matches!(node_type, "comment")
         }
+        Language::Toml => {
+            matches!(node_type, "comment")
+        }
+        Language::Yaml => {
+            matches!(node_type, "comment")
+        }
         _ => false,
     }
 }
@@ -309,6 +317,12 @@ fn is_semantic_node(node_type: &str, language: &Language) -> bool {
                     | "template_declaration"
             )
         }
+        Language::Toml => {
+            matches!(node_type, "table" | "table_array_element")
+        }
+        Language::Yaml => {
+            matches!(node_type, "block_mapping_pair" | "block_sequence")
+        }
         _ => false,
     }
 }
@@ -771,4 +785,104 @@ T max(T a, T b) {
             "Should find class_specifier or namespace_definition"
         );
     }
+
+    #[test]
+    fn test_parse_toml() {
+        let content = r#"
+# This is a TOML configuration file
+
+[package]
+name = "my-project"
+version = "1.0.0"
+edition = "2021"
+
+[dependencies]
+serde = { version = "1.0", features = ["derive"] }
+tokio = "1.0"
+
+[[bin]]
+name = "my-app"
+path = "src/main.rs"
+
+[profile.release]
+lto = true
+opt-level = 3
+"#;
+
+        let chunks = parse_file_internal("Cargo.toml", content).unwrap();
+        assert!(!chunks.is_empty(), "Should have chunks for TOML");
+
+        let has_table = chunks.iter().any(|c| c.chunk_type == "table");
+        let has_table_array = chunks.iter().any(|c| c.chunk_type == "table_array_element");
+        assert!(
+            has_table || has_table_array,
+            "Should find table or table_array_element"
+        );
+    }
+
+    #[test]
+    fn test_parse_yaml() {
+        let content = r#"
+# Kubernetes deployment config
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: my-app
+  labels:
+    app: my-app
+spec:
+  replicas: 3
+  selector:
+    matchLabels:
+      app: my-app
+  template:
+    metadata:
+      labels:
+        app: my-app
+    spec:
+      containers:
+        - name: my-app
+          image: my-app:latest
+          ports:
+            - containerPort: 8080
+"#;
+
+        let chunks = parse_file_internal("deployment.yaml", content).unwrap();
+        assert!(!chunks.is_empty(), "Should have chunks for YAML");
+    }
+
+    #[test]
+    fn test_parse_markdown_fallback() {
+        let content = r#"
+# My Project
+
+This is a **markdown** file with various content.
+
+## Installation
+
+```bash
+npm install my-project
+```
+
+## Usage
+
+Here's how to use the library:
+
+```typescript
+import { myFunction } from 'my-project';
+myFunction();
+```
+
+## Contributing
+
+Please read CONTRIBUTING.md for details.
+"#;
+
+        let chunks = parse_file_internal("README.md", content).unwrap();
+        // Markdown falls back to line-based chunking
+        assert!(!chunks.is_empty(), "Should have chunks for Markdown");
+        // Should be block type since we use line-based chunking
+        let has_block = chunks.iter().any(|c| c.chunk_type == "block");
+        assert!(has_block, "Markdown should use block chunking");
+    }
 }