Skip to content

FileSynchronizer tracks binary files — merkle DAG bypasses supportedExtensions filter #286

@giovanecesar

Description

@giovanecesar

Summary

FileSynchronizer.generateFileHashes() hashes all non-ignored files regardless of extension. It does not receive or use supportedExtensions, so the merkle DAG tracks binary files (PDFs, PNGs, tar.gz, etc.). During reindexByChange(), changed binary files are passed to the embedding provider as utf-8 text.

Root cause

Two independent file-traversal systems with different filters:

Component Filters by extension? Result
Context.getCodeFiles() Yes — checks supportedExtensions.includes(ext) Correct: only indexes supported files
FileSynchronizer.generateFileHashes() No — only checks ignorePatterns Bug: hashes PDFs, images, archives

FileSynchronizer constructor accepts ignorePatterns but has no concept of supportedExtensions. Since DEFAULT_IGNORE_PATTERNS doesn't block .pdf, .png, .tar.gz, etc., they get tracked.

Impact

  1. Binary content sent to embedding providerreindexByChange() detects "changes" to binary files, reads them as utf-8 (readFile(filePath, 'utf-8')), and sends garbled content to the embedding API. Wasted tokens and potentially corrupted vector space.
  2. Wasted I/O on every sync cycle — merkle DAG hashes binary files on every checkForChanges() call, even though they'll never be indexed.
  3. Privacy risk — sensitive binary files (e.g., confidential PDFs) get hashed and their paths stored in ~/.context/merkle/ snapshots.

Reproduction

  1. Create or use a codebase containing PDF files not listed in .gitignore
  2. Index the codebase with index_codebase
  3. Inspect the merkle snapshot:
    jq -r '.fileHashes[][0]' ~/.context/merkle/<hash>.json | grep -E '\.(pdf|png|jpg|zip|tar|gz)$'
  4. Observe binary files are tracked despite not being in supportedExtensions
  5. Modify a tracked binary file and trigger a sync — embedding provider receives garbled utf-8 content

Suggested fix

Pass supportedExtensions to FileSynchronizer and filter in generateFileHashes():

// In FileSynchronizer constructor
constructor(rootDir: string, ignorePatterns: string[] = [], supportedExtensions: string[] = []) {
    // ...
    this.supportedExtensions = supportedExtensions;
}

// In generateFileHashes(), after isFile() check
} else if (stat.isFile()) {
    const ext = path.extname(entry.name);
    if (this.supportedExtensions.length > 0 && !this.supportedExtensions.includes(ext)) {
        continue; // Skip unsupported extensions
    }
    // ... existing hash logic
}

Update callers in Context.indexCodebase() and Context.reindexByChange() to pass this.supportedExtensions when constructing FileSynchronizer.

Environment

  • @zilliz/claude-context-mcp@latest
  • Local Milvus
  • Local embedding provider
  • macOS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions