This document outlines the development plan for a Tree-sitter based code compressor. The goal is to extract key structural information from source code, such as imports, package definitions, function/method/class signatures, and comments, while omitting detailed implementation bodies.
- Identify language from file extension.
- Parse source code into an AST using Tree-sitter.
- Implement basic Tree-sitter query for Go (imports, functions, methods, types).
Status: Completed.
- Define
CodeChunkstruct (Note:OriginalLinefield is a placeholder, actual line mapping not yet implemented - see Phase 6). - Implement
GenericCompressorwithCompressmethod. - Execute queries and process captures from matches.
- Implement logic to strip function/method bodies and retain signatures.
- Go
- Python
- JavaScript (for named functions, classes, methods, including exported and default exported named variants)
- Handle different types of captures (imports, packages, type definitions, comments).
- Sort and combine extracted code chunks.
Status: Largely completed. Core compression logic is functional for Go, Python, and key JavaScript constructs. All internal/compressor tests are passing.
- Go:
- Query for package, imports, type definitions, function declarations, method declarations, comments.
- Python:
- Query for imports, function definitions, class definitions, comments.
- JavaScript:
- Query for imports, comments, method definitions.
- Query and processing logic for:
- Exported named functions/classes/generators (e.g.,
export function foo() {}). - Default exported named functions/classes/generators (e.g.,
export default function foo() {}). - Standalone named functions/classes/generators.
- Exported named functions/classes/generators (e.g.,
- Outstanding/Needs Refinement for JavaScript:
- Arrow Functions: Implement body stripping for arrow functions assigned to variables (e.g.,
const myArrow = () => { /* body */ };). The test fileexample.jsincludesconst myArrowFunc = (a, b) => { /* Arrow function body */ ... }which is currently not processed for body stripping. - Anonymous Default Exports: Clarify and potentially implement body stripping for anonymous default exported functions and classes (e.g.,
export default function() { /* body */ }). Currently, these are captured by@export.otherand kept whole. The test fileexample.jsincludesexport default function() { console.log("Anon default func"); }andexport default class { constructor() { this.x = 1;} }, implying these might need stripping. - Review other common JS constructs (e.g., object methods not part of class syntax, IIFEs) if they need specific handling.
- Arrow Functions: Implement body stripping for arrow functions assigned to variables (e.g.,
- Other Languages:
- Plan for and implement queries for other languages as needed (e.g., TypeScript, Java, C++).
Status: Go and Python are well-covered by current tests. JavaScript has significantly improved and handles many common cases, with tests passing for implemented features. Specific outstanding items for JavaScript are noted above.
- Write initial unit tests for Go, Python, JavaScript compressor logic. (Current tests for
internal/compressorare passing). - Expand test coverage with more complex real-world examples and edge cases for all supported languages, particularly for JavaScript features listed as outstanding in Phase 3.
- Test interactions between different JavaScript export/import syntaxes.
Status: Basic unit tests are in place and passing. Further testing is required, especially for JavaScript refinements.
- Integrate the
GenericCompressorintomain.go. - Add CLI flags for language selection, input file/directory, output options.
- Handle file system traversal for multiple files.
Status: Not started.
- Line Number Mapping: Fully implement
OriginalLineinCodeChunkto map compressed chunks back to their original line numbers. - Contextual Compression: Explore options for keeping more context if needed (e.g., call sites of a function, specific variable assignments).
- Configuration: Allow users to customize what to extract/omit via configuration files or advanced CLI options.
- Performance Optimization: Profile and optimize parsing and query execution for large codebases.
Status: Not started.
Languages supported by smacker/go-tree-sitter:
bash
c
cpp
csharp
css
cue
dockerfile
elixir
elm
golang
groovy
hcl
html
java
javascript
kotlin
lua
markdown
ocaml
php
protobuf
python
ruby
rust
scala
sql
svelte
swift
toml
typescript
markdown
yaml
The most important to support initially (if we even have to do anything to enable them?) are:
- Go
- Python
- JavaScript
- TypeScript
- bash
- rust
- swift
- toml
- yaml
- css
- c
- html
- sql