Skip to content

fix(mcp): retry _get_collection once on transient failure (#1286)#1377

Merged
igorls merged 1 commit intodevelopfrom
fix/get-collection-retry-on-exception
May 6, 2026
Merged

fix(mcp): retry _get_collection once on transient failure (#1286)#1377
igorls merged 1 commit intodevelopfrom
fix/get-collection-retry-on-exception

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented May 6, 2026

Summary

Surgical extraction of the _get_collection retry+log idea from #1286. Wraps the body in for attempt in range(2); on attempt 0 failure, log via logger.exception(...) and clear _client_cache / _collection_cache / _metadata_cache so the next iteration forces _get_client() to rebuild. That path now re-runs quarantine_stale_hnsw (#1322), so the second attempt heals the common stale-handle case automatically.

Why this PR exists separately

#1286 carries the same idea (~25 LOC of mcp_server.py) plus a 5500-line fork-sync (FORK_CHANGELOG.md, docs/superpowers/**, deploy scripts, version bump that breaks check-versions, a convo_miner.py rewrite that regresses test_convo_mining and test_mine_convos_rebuilds_stale_drawers_after_schema_bump on all three OSes). I extracted just the retry shape so we can land it ahead of the v3.3.5 deadline; #1286 will be closed with a comment offering to land the genuinely-useful follow-ups (session-recovery collection split, chroma.py None-meta coercion, closet-boost ranking refactor) as separate PRs.

Credit to @jphein via Co-authored-by: on the commit.

Behavior

  • Healthy palace, warm cache: zero overhead (returns on first attempt).
  • Healthy palace, cold cache: zero overhead (returns on first attempt).
  • Transient failure (stale handle, partial flush race, etc.): one retry after cache clear, succeeds.
  • Permanent failure: returns None (matches the prior contract).
  • Every failure is logged via logger.exception(...) — previously silent.

Test plan

  • New unit tests in tests/test_mcp_server.py::TestCacheInvalidation:
    • test_get_collection_retries_once_on_exception — first attempt raises, second succeeds; result is returned, not None.
    • test_get_collection_returns_none_after_two_failures — both attempts fail; assert exactly 2 attempts and None returned (no infinite loop).
  • uvx ruff@0.4 check . and ruff format --check . clean.
  • Full suite locally: 1554 passed, 1 skipped in 55s (pytest tests/ --ignore=tests/benchmarks).

Out of scope (deferred to follow-ups)

If @jphein can rebase those onto current develop, happy to review them as separate PRs.

A transient chromadb exception inside `_get_collection` was swallowed by
the bare `except Exception: return None`, leaving every subsequent tool
call hitting the same poisoned cache silently. The fix wraps the body
in a `for attempt in range(2)` loop: on attempt 0 failure, log via
`logger.exception(...)` and clear `_client_cache` / `_collection_cache`
/ `_metadata_cache` so the next iteration forces `_get_client()` to
rebuild from scratch — that path now re-runs `quarantine_stale_hnsw`
(per #1322), so the second attempt heals the common stale-handle case
automatically. If both attempts fail, return `None` (matches the prior
contract for permanent failures).

Two new tests in `tests/test_mcp_server.py::TestCacheInvalidation`:
- `test_get_collection_retries_once_on_exception` — first attempt raises
  via a monkeypatched `_get_client`, second attempt succeeds; assert the
  caller gets the collection back, not None.
- `test_get_collection_returns_none_after_two_failures` — both attempts
  fail, assert we exhaust the loop and return None (no infinite retry).

Surgical extraction from PR #1286, which carried the same fix idea
(plus a fork-sync bundle that couldn't be merged); credit to the
original author below.

Co-authored-by: Jeffrey Hein <jp@jphein.com>
Copilot AI review requested due to automatic review settings May 6, 2026 07:52
@igorls igorls requested a review from milla-jovovich as a code owner May 6, 2026 07:52
@igorls igorls added this to the v3.3.5 milestone May 6, 2026
@igorls igorls requested a review from bensig as a code owner May 6, 2026 07:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the MCP server’s resilience when opening the ChromaDB collection by logging failures and retrying _get_collection() once after clearing cached client/collection/metadata state, addressing common “stale handle” transient failures without requiring a process restart.

Changes:

  • Wrap mcp_server._get_collection() in a 2-attempt loop, logging exceptions and clearing caches before the single retry.
  • Preserve the existing API contract by still returning None after two failures.
  • Add unit tests covering the “fails once then succeeds” retry behavior and the “fails twice then returns None” behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
mempalace/mcp_server.py Add exception logging + one retry with cache invalidation to self-heal transient collection-open failures.
tests/test_mcp_server.py Add regression tests validating retry-once semantics and no-infinite-loop behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@igorls igorls merged commit f0d2360 into develop May 6, 2026
10 checks passed
jphein added a commit to jphein/mempalace that referenced this pull request May 6, 2026
When the vector index returns fewer than n_results (sparse HNSW
post-repair, MemPalace#951 filter-planner failure, drift), search_memories now:

1. Computes an authoritative scope count via paginated col.get(),
   surfaced as `available_in_scope` in the response. Caps each query
   below MemPalace#950's SQL-variable limit.

2. Tops up the hits list with BM25-ranked sqlite candidates tagged
   `matched_via: "sqlite_bm25_fallback"` when the vector path is
   under-delivering. Skips candidates with BM25 score 0 so the
   fallback never pads with unrelated content.

3. Returns `warnings: [...]` describing when fallback fired and when
   the scope contains more drawers than the vector path can rank
   (gated on a `vector_underdelivered` flag captured before fallback
   runs, so the warning surfaces even when BM25 papered over the gap).

CLI search() delegates to search_memories() so terminal output and
MCP responses share the same retrieval, fallback, and warning
semantics. Preserves the palace path in printed errors.

Closes the silent 0-hit failure mode where data was in sqlite but
the vector path returned nothing — visible to the user via warnings
and `available_in_scope`, fixable via `mempalace repair`.

Tests: 29/29 pass on rebased branch (Python 3.9 floor honored via
Optional[int]). Mock setup updated to set count.return_value so the
new "more in scope" warning path doesn't fail on MagicMock comparison.

Squashed rebase against current upstream/develop (post-MemPalace#1377). Was
filed as 5-commit history; squashed for cleaner review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants