Summary
FileSynchronizer.generateFileHashes() hashes all non-ignored files regardless of extension. It does not receive or use supportedExtensions, so the merkle DAG tracks binary files (PDFs, PNGs, tar.gz, etc.). During reindexByChange(), changed binary files are passed to the embedding provider as utf-8 text.
Root cause
Two independent file-traversal systems with different filters:
| Component |
Filters by extension? |
Result |
Context.getCodeFiles() |
Yes — checks supportedExtensions.includes(ext) |
Correct: only indexes supported files |
FileSynchronizer.generateFileHashes() |
No — only checks ignorePatterns |
Bug: hashes PDFs, images, archives |
FileSynchronizer constructor accepts ignorePatterns but has no concept of supportedExtensions. Since DEFAULT_IGNORE_PATTERNS doesn't block .pdf, .png, .tar.gz, etc., they get tracked.
Impact
- Binary content sent to embedding provider —
reindexByChange() detects "changes" to binary files, reads them as utf-8 (readFile(filePath, 'utf-8')), and sends garbled content to the embedding API. Wasted tokens and potentially corrupted vector space.
- Wasted I/O on every sync cycle — merkle DAG hashes binary files on every
checkForChanges() call, even though they'll never be indexed.
- Privacy risk — sensitive binary files (e.g., confidential PDFs) get hashed and their paths stored in
~/.context/merkle/ snapshots.
Reproduction
- Create or use a codebase containing PDF files not listed in
.gitignore
- Index the codebase with
index_codebase
- Inspect the merkle snapshot:
jq -r '.fileHashes[][0]' ~/.context/merkle/<hash>.json | grep -E '\.(pdf|png|jpg|zip|tar|gz)$'
- Observe binary files are tracked despite not being in
supportedExtensions
- Modify a tracked binary file and trigger a sync — embedding provider receives garbled utf-8 content
Suggested fix
Pass supportedExtensions to FileSynchronizer and filter in generateFileHashes():
// In FileSynchronizer constructor
constructor(rootDir: string, ignorePatterns: string[] = [], supportedExtensions: string[] = []) {
// ...
this.supportedExtensions = supportedExtensions;
}
// In generateFileHashes(), after isFile() check
} else if (stat.isFile()) {
const ext = path.extname(entry.name);
if (this.supportedExtensions.length > 0 && !this.supportedExtensions.includes(ext)) {
continue; // Skip unsupported extensions
}
// ... existing hash logic
}
Update callers in Context.indexCodebase() and Context.reindexByChange() to pass this.supportedExtensions when constructing FileSynchronizer.
Environment
@zilliz/claude-context-mcp@latest
- Local Milvus
- Local embedding provider
- macOS
Summary
FileSynchronizer.generateFileHashes()hashes all non-ignored files regardless of extension. It does not receive or usesupportedExtensions, so the merkle DAG tracks binary files (PDFs, PNGs, tar.gz, etc.). DuringreindexByChange(), changed binary files are passed to the embedding provider as utf-8 text.Root cause
Two independent file-traversal systems with different filters:
Context.getCodeFiles()supportedExtensions.includes(ext)FileSynchronizer.generateFileHashes()ignorePatternsFileSynchronizerconstructor acceptsignorePatternsbut has no concept ofsupportedExtensions. SinceDEFAULT_IGNORE_PATTERNSdoesn't block.pdf,.png,.tar.gz, etc., they get tracked.Impact
reindexByChange()detects "changes" to binary files, reads them as utf-8 (readFile(filePath, 'utf-8')), and sends garbled content to the embedding API. Wasted tokens and potentially corrupted vector space.checkForChanges()call, even though they'll never be indexed.~/.context/merkle/snapshots.Reproduction
.gitignoreindex_codebasesupportedExtensionsSuggested fix
Pass
supportedExtensionstoFileSynchronizerand filter ingenerateFileHashes():Update callers in
Context.indexCodebase()andContext.reindexByChange()to passthis.supportedExtensionswhen constructingFileSynchronizer.Environment
@zilliz/claude-context-mcp@latest