Skip to content

fix: detect mtime changes in _get_collection to prevent stale HNSW index#663

Closed
jphein wants to merge 2 commits intoMemPalace:developfrom
jphein:fix/mcp-mtime-reconnect-v2
Closed

fix: detect mtime changes in _get_collection to prevent stale HNSW index#663
jphein wants to merge 2 commits intoMemPalace:developfrom
jphein:fix/mcp-mtime-reconnect-v2

Conversation

@jphein
Copy link
Copy Markdown
Collaborator

@jphein jphein commented Apr 12, 2026

Problem

When external tools write to the palace database (CLI mining, scripts), the
MCP server's cached ChromaDB collection becomes stale — its in-memory HNSW
index doesn't know about new vectors. Searches return incomplete results until
the MCP server process restarts.

The previous _get_collection() only invalidated the cache on process restart.
It had no mechanism to detect that the on-disk database changed.

Solution

1. mtime-based stale index detection

Track both the inode and mtime of chroma.sqlite3 in module-level globals.
On every _get_collection() call, stat the file and compare:

  • Inode change — catches full palace rebuilds (repair/nuke/purge), which
    replace the file entirely
  • mtime change — catches in-place writes from CLI mining, scripts, or other
    processes that append to the existing database without replacing it

Epsilon comparison (abs(current - cached) > 0.01) avoids spurious
reconnects from filesystem timestamp rounding.

FAT/exFAT filesystems return st_ino == 0 — the current_inode != 0 guard
safely skips inode detection on those filesystems.

2. mempalace_reconnect MCP tool

Explicit cache flush for cases where automatic detection is insufficient —
particularly metadata-only col.update() calls, which may not reliably
change mtime. Also useful in tests and scripts that need a guaranteed-fresh
connection.

Relation to #625

PR #625 (yukinoli) addresses the same stale-index problem with a different
technique: SQL COUNT(*) on the embeddings table to detect new rows since
last cache time.

This PR uses filesystem stat instead. The two approaches have different
trade-offs:

This PR (mtime) #625 (SQL COUNT)
Detects new embeddings Yes (mtime changes) Yes (count changes)
Detects metadata-only updates Yes (mtime changes) No (count unchanged)
Detects deletes Yes (mtime changes) Only if count drops
Detects external script writes Yes Only if they add rows
Overhead per call os.stat() — ~1µs SQL query — ~1ms
Works on all ChromaDB versions Yes Depends on schema

For most workloads the mtime approach has lower overhead and broader
coverage. The mempalace_reconnect tool handles the remaining edge cases
(metadata-only changes that don't touch mtime reliably).

Both approaches are valid; the maintainers can choose whichever fits the
project's direction.

Changes

  • mempalace/mcp_server.py
    • _palace_db_inode, _palace_db_mtime module-level globals
    • _get_collection(): stat chroma.sqlite3 and invalidate cache on
      inode or mtime change; update stored values after successful reconnect
    • tool_reconnect(): clear all caches and force a fresh _get_collection()
    • TOOLS["mempalace_reconnect"]: expose as MCP tool

Testing

  • ruff check passes (line length 100)
  • Existing test suite passes (python -m pytest tests/ -x -q)
  • Manually verified: mine via CLI → MCP search immediately returns new results
    without server restart

Copilot AI review requested due to automatic review settings April 12, 2026 02:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses stale ChromaDB in-memory HNSW indexes in the MCP server when chroma.sqlite3 is modified by external processes (CLI mining, scripts) by detecting on-disk changes and forcing a reconnect, and adds an explicit MCP maintenance tool to manually flush caches.

Changes:

  • Add inode + mtime tracking for chroma.sqlite3 to invalidate the cached collection when the underlying DB changes.
  • Add a new MCP tool mempalace_reconnect to force cache invalidation and reconnect.
  • Update the module header docs to list the new maintenance tool.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/mcp_server.py Outdated
Comment on lines +124 to +133
current_inode = st.st_ino
current_mtime = st.st_mtime
except OSError:
current_inode = 0
current_mtime = 0.0

inode_changed = current_inode and current_inode != _palace_db_inode
mtime_changed = current_mtime and abs(current_mtime - _palace_db_mtime) > 0.01

if inode_changed or mtime_changed:
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If chroma.sqlite3 disappears (or is temporarily missing during a rebuild), this code sets current_inode/current_mtime to 0 and then does not invalidate the existing _collection_cache because both inode_changed and mtime_changed become false. That can leave the server holding a cached collection pointing at a DB file that no longer exists (or has been replaced later). Consider treating an os.stat() failure as a change when a cache exists (e.g., invalidate when _palace_db_mtime/_palace_db_inode were previously non-zero, or track existence explicitly).

Suggested change
current_inode = st.st_ino
current_mtime = st.st_mtime
except OSError:
current_inode = 0
current_mtime = 0.0
inode_changed = current_inode and current_inode != _palace_db_inode
mtime_changed = current_mtime and abs(current_mtime - _palace_db_mtime) > 0.01
if inode_changed or mtime_changed:
stat_ok = True
current_inode = st.st_ino
current_mtime = st.st_mtime
except OSError:
stat_ok = False
current_inode = 0
current_mtime = 0.0
inode_changed = current_inode and current_inode != _palace_db_inode
mtime_changed = current_mtime and abs(current_mtime - _palace_db_mtime) > 0.01
db_missing_changed = (
not stat_ok
and (
_collection_cache is not None
or _palace_db_inode != 0
or _palace_db_mtime != 0.0
)
)
if inode_changed or mtime_changed or db_missing_changed:

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added explicit guard: when chroma.sqlite3 disappears and a cached collection exists, cache is cleared and inode/mtime globals reset to 0.

Comment thread mempalace/mcp_server.py Outdated
Comment on lines +143 to +144
_palace_db_inode = current_inode
_palace_db_mtime = current_mtime
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_palace_db_inode/_palace_db_mtime are updated from the pre-connection stat values. When create=True and chroma.sqlite3 doesn’t exist yet, current_inode/current_mtime will be 0, so the cache snapshot is wrong and can force an unnecessary reconnect on the next call. Also, if the DB changes between the stat and the successful reconnect, the stored snapshot may not match the actual opened DB. Restat after successfully creating/opening the collection (or otherwise capture the post-connect inode/mtime) before updating the globals.

Suggested change
_palace_db_inode = current_inode
_palace_db_mtime = current_mtime
try:
st = os.stat(db_path)
_palace_db_inode = st.st_ino
_palace_db_mtime = st.st_mtime
except OSError:
_palace_db_inode = 0
_palace_db_mtime = 0.0

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented — added comment explaining the benign race. Worst case is one extra reconnect on next call, which self-corrects.

Comment thread mempalace/mcp_server.py Outdated
_palace_db_mtime = 0.0
try:
col = _get_collection()
count = col.count() if col else 0
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tool_reconnect() currently reports success even when no collection could be opened (col is None), returning drawers=0. This makes it hard for callers to distinguish “reconnected” from “no palace / failed to connect”. Consider returning _no_palace() or success=False when _get_collection() returns None, and only reporting success after a real connection.

Suggested change
count = col.count() if col else 0
if not col:
return _no_palace()
count = col.count()

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — tool_reconnect() now returns {"success": false, "message": "No palace found after reconnect"} when _get_collection() returns None.

Comment thread mempalace/mcp_server.py
Comment on lines +962 to +972
"mempalace_reconnect": {
"description": (
"Force reconnect to the palace database. Use after external scripts or CLI commands"
" modified the palace directly, which can leave the in-memory HNSW index stale."
),
"input_schema": {
"type": "object",
"properties": {},
},
"handler": tool_reconnect,
},
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behavior (mtime/inode-based invalidation + mempalace_reconnect tool) isn’t covered by the existing MCP server tests. Since tests already exercise tools/list and tools/call dispatch, it would be good to add regression tests that (1) touch/modify chroma.sqlite3 and assert _get_collection() returns a new instance and (2) verify mempalace_reconnect appears in tools/list and clears the cache.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added 5 tests in TestCacheInvalidation: mtime change, inode change, missing DB, reconnect failure, and reconnect success.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 12, 2026
… tests

Addresses Copilot review feedback on MemPalace#663.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jphein jphein requested a review from igorls as a code owner April 12, 2026 03:17
jphein and others added 2 commits April 12, 2026 14:49
_get_collection() cached the ChromaDB collection wrapper but only
invalidated on process restart. When external tools (CLI mining,
scripts) wrote to the palace database in-place, the in-memory HNSW
index became stale — vector searches returned stale results until the
MCP server process restarted.

Fix: check both inode and mtime of chroma.sqlite3 on every
_get_collection() call. Inode changes catch full rebuilds (repair/nuke);
mtime changes catch in-place writes. Epsilon comparison (0.01s) avoids
spurious reconnects from filesystem timestamp granularity.

Also adds mempalace_reconnect MCP tool for manual cache flush — useful
after metadata-only updates (col.update()) that may not reliably change
mtime.
… tests

Addresses Copilot review feedback on MemPalace#663.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jphein jphein force-pushed the fix/mcp-mtime-reconnect-v2 branch from f8adfea to 59e0449 Compare April 12, 2026 21:53
igorls pushed a commit that referenced this pull request Apr 13, 2026
… tests

Addresses Copilot review feedback on #663.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
igorls added a commit that referenced this pull request Apr 13, 2026
…757)

When external tools write to the palace database (CLI mining, scripts), the MCP server's cached ChromaDB collection becomes stale — its HNSW index doesn't know about new vectors. Develop already invalidates on inode changes (catches rebuilds) but not on mtime changes (misses in-place writes).

This PR:
- Adds st_mtime tracking alongside st_ino in _get_client; invalidates the cached client on either change.
- Adds the mempalace_reconnect MCP tool for explicit cache flush.

Original author: @jphein (#663). Original approval: @Ari4ka.
Skips test_missing_db_invalidates_cache on Windows (ChromaDB holds chroma.sqlite3 open).
@igorls igorls closed this in #757 Apr 13, 2026
@igorls
Copy link
Copy Markdown
Member

igorls commented Apr 13, 2026

Landed on `develop` as #757 (merge `e200ce2`) — cherry-picked your two commits (authorship preserved) plus one CI-fix commit of mine:

  • Dropped an unused `result =` in `test_missing_db_invalidates_cache` (ruff F841).
  • `@pytest.mark.skipif(sys.platform == "win32", ...)` on that same test — on Windows, ChromaDB keeps `chroma.sqlite3` open through the cached client, so `os.remove()` raises `PermissionError` before the cache-invalidation path runs. Linux/macOS still exercise it.
  • Ruff-formatted the multi-global declaration in `_get_client`.

Closing this one; thanks for the thorough write-up comparing against #625.

Heads up — per RFC 001 (#743) §2.6, this freshness logic will migrate into `ChromaBackend.get_collection()` / `ChromaBackend.close_palace()` during the §10 cleanup. I'll add #757 to the §11 in-flight table so it's tracked during that migration.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 13, 2026
…ed var)

Post-v3.2.0 merge cleanup:
- mcp_server.py: remove duplicate tool_reconnect (keep ours with broader cache clear + add upstream's None guard)
- mcp_server.py: remove duplicate mempalace_reconnect TOOLS entry
- miner.py: remove unused effective_min variable
- CLAUDE.md: update version to 3.2.0, mark MemPalace#663 as closed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jphein jphein deleted the fix/mcp-mtime-reconnect-v2 branch April 19, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants