fix: detect mtime changes in _get_collection to prevent stale HNSW index#663
fix: detect mtime changes in _get_collection to prevent stale HNSW index#663jphein wants to merge 2 commits intoMemPalace:developfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses stale ChromaDB in-memory HNSW indexes in the MCP server when chroma.sqlite3 is modified by external processes (CLI mining, scripts) by detecting on-disk changes and forcing a reconnect, and adds an explicit MCP maintenance tool to manually flush caches.
Changes:
- Add inode + mtime tracking for
chroma.sqlite3to invalidate the cached collection when the underlying DB changes. - Add a new MCP tool
mempalace_reconnectto force cache invalidation and reconnect. - Update the module header docs to list the new maintenance tool.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| current_inode = st.st_ino | ||
| current_mtime = st.st_mtime | ||
| except OSError: | ||
| current_inode = 0 | ||
| current_mtime = 0.0 | ||
|
|
||
| inode_changed = current_inode and current_inode != _palace_db_inode | ||
| mtime_changed = current_mtime and abs(current_mtime - _palace_db_mtime) > 0.01 | ||
|
|
||
| if inode_changed or mtime_changed: |
There was a problem hiding this comment.
If chroma.sqlite3 disappears (or is temporarily missing during a rebuild), this code sets current_inode/current_mtime to 0 and then does not invalidate the existing _collection_cache because both inode_changed and mtime_changed become false. That can leave the server holding a cached collection pointing at a DB file that no longer exists (or has been replaced later). Consider treating an os.stat() failure as a change when a cache exists (e.g., invalidate when _palace_db_mtime/_palace_db_inode were previously non-zero, or track existence explicitly).
| current_inode = st.st_ino | |
| current_mtime = st.st_mtime | |
| except OSError: | |
| current_inode = 0 | |
| current_mtime = 0.0 | |
| inode_changed = current_inode and current_inode != _palace_db_inode | |
| mtime_changed = current_mtime and abs(current_mtime - _palace_db_mtime) > 0.01 | |
| if inode_changed or mtime_changed: | |
| stat_ok = True | |
| current_inode = st.st_ino | |
| current_mtime = st.st_mtime | |
| except OSError: | |
| stat_ok = False | |
| current_inode = 0 | |
| current_mtime = 0.0 | |
| inode_changed = current_inode and current_inode != _palace_db_inode | |
| mtime_changed = current_mtime and abs(current_mtime - _palace_db_mtime) > 0.01 | |
| db_missing_changed = ( | |
| not stat_ok | |
| and ( | |
| _collection_cache is not None | |
| or _palace_db_inode != 0 | |
| or _palace_db_mtime != 0.0 | |
| ) | |
| ) | |
| if inode_changed or mtime_changed or db_missing_changed: |
There was a problem hiding this comment.
Fixed — added explicit guard: when chroma.sqlite3 disappears and a cached collection exists, cache is cleared and inode/mtime globals reset to 0.
| _palace_db_inode = current_inode | ||
| _palace_db_mtime = current_mtime |
There was a problem hiding this comment.
_palace_db_inode/_palace_db_mtime are updated from the pre-connection stat values. When create=True and chroma.sqlite3 doesn’t exist yet, current_inode/current_mtime will be 0, so the cache snapshot is wrong and can force an unnecessary reconnect on the next call. Also, if the DB changes between the stat and the successful reconnect, the stored snapshot may not match the actual opened DB. Restat after successfully creating/opening the collection (or otherwise capture the post-connect inode/mtime) before updating the globals.
| _palace_db_inode = current_inode | |
| _palace_db_mtime = current_mtime | |
| try: | |
| st = os.stat(db_path) | |
| _palace_db_inode = st.st_ino | |
| _palace_db_mtime = st.st_mtime | |
| except OSError: | |
| _palace_db_inode = 0 | |
| _palace_db_mtime = 0.0 |
There was a problem hiding this comment.
Documented — added comment explaining the benign race. Worst case is one extra reconnect on next call, which self-corrects.
| _palace_db_mtime = 0.0 | ||
| try: | ||
| col = _get_collection() | ||
| count = col.count() if col else 0 |
There was a problem hiding this comment.
tool_reconnect() currently reports success even when no collection could be opened (col is None), returning drawers=0. This makes it hard for callers to distinguish “reconnected” from “no palace / failed to connect”. Consider returning _no_palace() or success=False when _get_collection() returns None, and only reporting success after a real connection.
| count = col.count() if col else 0 | |
| if not col: | |
| return _no_palace() | |
| count = col.count() |
There was a problem hiding this comment.
Fixed — tool_reconnect() now returns {"success": false, "message": "No palace found after reconnect"} when _get_collection() returns None.
| "mempalace_reconnect": { | ||
| "description": ( | ||
| "Force reconnect to the palace database. Use after external scripts or CLI commands" | ||
| " modified the palace directly, which can leave the in-memory HNSW index stale." | ||
| ), | ||
| "input_schema": { | ||
| "type": "object", | ||
| "properties": {}, | ||
| }, | ||
| "handler": tool_reconnect, | ||
| }, |
There was a problem hiding this comment.
New behavior (mtime/inode-based invalidation + mempalace_reconnect tool) isn’t covered by the existing MCP server tests. Since tests already exercise tools/list and tools/call dispatch, it would be good to add regression tests that (1) touch/modify chroma.sqlite3 and assert _get_collection() returns a new instance and (2) verify mempalace_reconnect appears in tools/list and clears the cache.
There was a problem hiding this comment.
Fixed — added 5 tests in TestCacheInvalidation: mtime change, inode change, missing DB, reconnect failure, and reconnect success.
… tests Addresses Copilot review feedback on MemPalace#663. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_get_collection() cached the ChromaDB collection wrapper but only invalidated on process restart. When external tools (CLI mining, scripts) wrote to the palace database in-place, the in-memory HNSW index became stale — vector searches returned stale results until the MCP server process restarted. Fix: check both inode and mtime of chroma.sqlite3 on every _get_collection() call. Inode changes catch full rebuilds (repair/nuke); mtime changes catch in-place writes. Epsilon comparison (0.01s) avoids spurious reconnects from filesystem timestamp granularity. Also adds mempalace_reconnect MCP tool for manual cache flush — useful after metadata-only updates (col.update()) that may not reliably change mtime.
… tests Addresses Copilot review feedback on MemPalace#663. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f8adfea to
59e0449
Compare
… tests Addresses Copilot review feedback on #663. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…757) When external tools write to the palace database (CLI mining, scripts), the MCP server's cached ChromaDB collection becomes stale — its HNSW index doesn't know about new vectors. Develop already invalidates on inode changes (catches rebuilds) but not on mtime changes (misses in-place writes). This PR: - Adds st_mtime tracking alongside st_ino in _get_client; invalidates the cached client on either change. - Adds the mempalace_reconnect MCP tool for explicit cache flush. Original author: @jphein (#663). Original approval: @Ari4ka. Skips test_missing_db_invalidates_cache on Windows (ChromaDB holds chroma.sqlite3 open).
|
Landed on `develop` as #757 (merge `e200ce2`) — cherry-picked your two commits (authorship preserved) plus one CI-fix commit of mine:
Closing this one; thanks for the thorough write-up comparing against #625. Heads up — per RFC 001 (#743) §2.6, this freshness logic will migrate into `ChromaBackend.get_collection()` / `ChromaBackend.close_palace()` during the §10 cleanup. I'll add #757 to the §11 in-flight table so it's tracked during that migration. |
…ed var) Post-v3.2.0 merge cleanup: - mcp_server.py: remove duplicate tool_reconnect (keep ours with broader cache clear + add upstream's None guard) - mcp_server.py: remove duplicate mempalace_reconnect TOOLS entry - miner.py: remove unused effective_min variable - CLAUDE.md: update version to 3.2.0, mark MemPalace#663 as closed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Problem
When external tools write to the palace database (CLI mining, scripts), the
MCP server's cached ChromaDB collection becomes stale — its in-memory HNSW
index doesn't know about new vectors. Searches return incomplete results until
the MCP server process restarts.
The previous
_get_collection()only invalidated the cache on process restart.It had no mechanism to detect that the on-disk database changed.
Solution
1. mtime-based stale index detection
Track both the inode and mtime of
chroma.sqlite3in module-level globals.On every
_get_collection()call, stat the file and compare:replace the file entirely
processes that append to the existing database without replacing it
Epsilon comparison (
abs(current - cached) > 0.01) avoids spuriousreconnects from filesystem timestamp rounding.
FAT/exFAT filesystems return
st_ino == 0— thecurrent_inode != 0guardsafely skips inode detection on those filesystems.
2.
mempalace_reconnectMCP toolExplicit cache flush for cases where automatic detection is insufficient —
particularly metadata-only
col.update()calls, which may not reliablychange mtime. Also useful in tests and scripts that need a guaranteed-fresh
connection.
Relation to #625
PR #625 (yukinoli) addresses the same stale-index problem with a different
technique: SQL
COUNT(*)on the embeddings table to detect new rows sincelast cache time.
This PR uses filesystem stat instead. The two approaches have different
trade-offs:
os.stat()— ~1µsFor most workloads the mtime approach has lower overhead and broader
coverage. The
mempalace_reconnecttool handles the remaining edge cases(metadata-only changes that don't touch mtime reliably).
Both approaches are valid; the maintainers can choose whichever fits the
project's direction.
Changes
mempalace/mcp_server.py_palace_db_inode,_palace_db_mtimemodule-level globals_get_collection(): statchroma.sqlite3and invalidate cache oninode or mtime change; update stored values after successful reconnect
tool_reconnect(): clear all caches and force a fresh_get_collection()TOOLS["mempalace_reconnect"]: expose as MCP toolTesting
ruff checkpasses (line length 100)python -m pytest tests/ -x -q)without server restart