fix(searcher): tolerate None documents in BM25 reranker#1198
fix(searcher): tolerate None documents in BM25 reranker#1198igorls merged 2 commits intoMemPalace:developfrom
Conversation
`_tokenize` calls `text.lower()` unconditionally; when ChromaDB returns a
drawer with `documents` containing `None`, the hybrid-rerank path raises
`AttributeError: 'NoneType' object has no attribute 'lower'`.
Observed in production daemon log (2026-04-24 21:07:05) during a search
that triggered `_hybrid_rank → _bm25_scores → _tokenize`:
File "mempalace/searcher.py", line 81, in _bm25_scores
tokenized = [_tokenize(d) for d in documents]
File "mempalace/searcher.py", line 52, in _tokenize
return _TOKEN_RE.findall(text.lower())
AttributeError: 'NoneType' object has no attribute 'lower'
Closes the gap left by the upstream None-metadata audit (MemPalace#999), which
covered metadata loops but not BM25 helpers. Returns `[]` for falsy input
so a None doc gets score 0.0 while the rest of the corpus reranks normally.
Three regression tests in TestBM25NoneSafety lock the behavior and reference
the production trace.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens the hybrid BM25 re-ranker against None documents returned by ChromaDB so searches don’t crash mid-rerank in production.
Changes:
- Add a
None/empty guard insearcher._tokenizeso missing documents produce no tokens (and thus BM25 score 0.0). - Add regression tests covering
_tokenize(None),_tokenize(""), and_bm25_scoreswith a mixed corpus containing aNone.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
mempalace/searcher.py |
Makes _tokenize tolerate falsy input to prevent AttributeError during BM25 scoring. |
tests/test_searcher.py |
Adds regression tests to ensure BM25 helpers don’t crash when Chroma returns None documents. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _tokenize(text: str) -> list: | ||
| """Lowercase + strip to alphanumeric tokens of length ≥ 2.""" | ||
| """Lowercase + strip to alphanumeric tokens of length ≥ 2. | ||
|
|
||
| Tolerates ``None`` documents — Chroma can return ``None`` in the | ||
| ``documents`` field for drawers without text content, which would |
There was a problem hiding this comment.
_tokenize now intentionally accepts None (and tests cover it), but the type annotation still says text: str. Please update the signature/return typing to reflect the supported inputs (e.g., str | None) and consider narrowing the guard to the intended cases (None/empty string) so unexpected non-string falsy values don’t get silently treated as empty text.
| from mempalace.searcher import _bm25_scores | ||
| scores = _bm25_scores("postgres migration", ["postgres migration done", None, "kafka rebalance"]) | ||
| assert len(scores) == 3 | ||
| assert scores[1] == 0.0 | ||
| assert scores[0] > 0.0 |
There was a problem hiding this comment.
This regression test claims “finite scores for the rest”, but it only asserts on scores[0] and the None doc. Consider also asserting that the other score(s) are finite (and/or the expected value, e.g. 0.0 for the non-matching doc) to make the test match its stated guarantee and catch potential NaN/inf regressions.
After today's work: - Merged upstream/develop (29 commits), absorbing MemPalace#976 (HNSW + mine_global_lock + PreCompact attempt-cap), MemPalace#1179 (CLI BM25), MemPalace#1180/MemPalace#1182/MemPalace#1183/MemPalace#1185 features. - Filed PR MemPalace#1198 (`_tokenize` None guard) upstream. - Closed PR MemPalace#1171 (adapter-level flock) — superseded by MemPalace#976 + daemon-strict. - palace-daemon migration commits (c09582c + 0e97b19) make the daemon the single canonical writer; in-tree adapter flock no longer carries its weight. README: - Lead paragraph: 2026-04-21 → 2026-04-25 sync date, 165,632 → 151,420 drawer count, daemon-on-disks topology, listed absorbed upstream PRs. - Fork change queue: dropped MemPalace#1171 row, added MemPalace#1198 row. - "Multi-client coordination" section: rewrote — palace-daemon is now the fork's primary answer (was "deferred"), MemPalace#1171 retired, narrative shifted to defense-in-depth around MemPalace#976 + daemon. - Two-layer memory model: storage path is now palace-daemon HTTP, not ~/.mempalace/palace. - Ecosystem palace-daemon row: marks integration as complete (commits cited). - Open upstream PRs table: refreshed to MemPalace org URLs, added MemPalace#1198 row, removed MemPalace#1171, added 2026-04-25 develop sync to merged list, added MemPalace#1171 to closed list with rationale. - Test count 1096 → 1334. CLAUDE.md: - Row 20 (MemPalace#1198): "Upstream PR pending" → "Filed as MemPalace#1198 on 2026-04-24". - Row MemPalace#1171 in PR table: open → closed (with MemPalace#976 supersession). - Added MemPalace#1198 row to PR table. - Top-of-section count: "14 merged, 7 open, 9 closed" → "14 merged, 7 open (including MemPalace#1198), 10 closed (added MemPalace#1171)" + sync date 2026-04-25. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI lint job runs `ruff format --check`; the new tests in TestBM25NoneSafety needed the standard "blank line after import-inside-function" + line-length wrap. No logic change — formatter pass only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new scripts plus legitimate fixes they surfaced. scripts/preflight.sh — runs the same checks CI does (ruff check, ruff format --check, pytest). Prevents the class of "I ran ruff format locally but CI runs ruff check too" bugs we hit on PR MemPalace#1024 and PR MemPalace#1198 today. scripts/rebase-on-develop.sh — rebases a fork PR branch onto upstream/develop, then auto-restores fork-only collateral (.claude-plugin/hooks/*.sh, .claude-plugin/.mcp.json) to upstream's state. The collateral cleanup was the recurring step I had to remember manually for each of MemPalace#1005/MemPalace#1024/MemPalace#1177; this codifies it. Supports --finish for the "after I resolved conflicts manually" continuation path. Never auto-pushes — prints the push command for confirmation. Lint debt the new preflight caught on main: - tests/test_palace_graph.py: F811 — `invalidate_graph_cache` was imported at line 9, then re-imported inside the chromadb-mock patch.dict block at line 44. Removed the duplicate; the line-9 import is sufficient since the autouse fixture at line 12 already uses it. - mempalace/backends/chroma.py: ruff format wanted a one-line `collection.modify(configuration=...)` call instead of the wrapped multi-line form. - mempalace/hooks_cli.py: ruff format wanted blank line after import in the daemon-URL try block, and dict-literal style for the json.dumps call. CI on jphein/mempalace runs both ruff check and ruff format --check; without these fixes the next push to main would have shown red. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bensig
left a comment
There was a problem hiding this comment.
Approve, no asks. Defensive single-line fix with a real production trace driving it.
What landed
`_tokenize` now returns `[]` for falsy input instead of raising `AttributeError` on `None.lower()`. The trace in the test docstring is genuine (daemon log 2026-04-24 21:07:05) — Chroma was returning `None` for drawers without text content during a hybrid-rerank pass, and the BM25 path crashed mid-search.
Test coverage
The new `TestBM25NoneSafety` class covers both the `None` and empty-string paths and includes the failure trace in the docstring so a future reader knows exactly what symptom this prevents. `pytest tests/test_searcher.py` — 27 passed on this branch.
Minor
Could also have hardened `_bm25_scores` upstream of `_tokenize` to filter `None` documents before iterating. The chosen approach (defend at the leaf function) is more defensive and catches any other call site that might pass `None` — better choice for a small surface like this.
Ship it.
…#1173 status CLAUDE.md row 28 documents the canonical-source pipeline (5a01aec) that landed earlier today. PR table updated to reflect bensig's 2026-04-26 approvals on four of our open PRs: - MemPalace#1173 (HNSW quarantine wire): approved on the original 1-commit shape, then force-pushed two safety commits (cold-start gate + integrity sniff-test) after a production cold-start incident destroyed three healthy 253MB segments. Now mergeable=CONFLICTING against develop (which moved with MemPalace#1210, MemPalace#1205, MemPalace#976) — needs rebase + re-review. - MemPalace#1177, MemPalace#1198, MemPalace#1201: approved, mergeable, awaiting merge. YAML manifest gets a new entry for the doc pipeline itself; FORK_CHANGELOG regenerated to match. promises.md (in claude-config, not this repo) got a long log entry covering today's full output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @bensig — ready for merge whenever convenient. Tests stable on the fork's daily-driver palace + smoke pass on the develop branch. |
…rry-pick of MemPalace#1094) Fork main was carrying the per-site `meta = meta or {}` guards from MemPalace#999 in eight read paths but didn't have the boundary coercion that closes the issue once for all callers. Cherry-picked MemPalace#1094 (open upstream, jp-authored) so the typed QueryResult/GetResult contract — both declare `metadatas: list[dict]`, never `list[Optional[dict]]` — is honored at the chromadb adapter boundary. Three same-family fork-ahead PRs we filed since (MemPalace#1198 _tokenize None-doc, MemPalace#1201 palace_graph None metadata, MemPalace#1199 review) all pointed at gaps that would have been impossible if this pattern had been in place. Future None-metadata bugs of the same shape are now catchable at one site instead of N. Per-site guards are kept as belt-and-suspenders for paths that might bypass the typed wrappers (e.g. legacy `import chromadb` direct calls). A future cleanup PR can remove them once we're confident no caller reaches chromadb directly. 6 new tests in test_backends.py (MemPalace#1094 originals); composes cleanly with fork main's _segment_appears_healthy + _quarantined_paths from earlier today. Full suite 1383/1383. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings in upstream's corpus-origin + privacy-warning track (PRs MemPalace#1211 MemPalace#1221 MemPalace#1223 MemPalace#1224 MemPalace#1225) plus the canonical merged versions of our four PRs that landed today (21:22-21:41 UTC): MemPalace#1173 quarantine_stale_hnsw on make_client + cold-start gate + integrity sniff-test (PROTO/STOP byte check, no deserialization) MemPalace#1177 .blob_seq_ids_migrated marker guard, closes MemPalace#1090 MemPalace#1198 _tokenize None-document guard in BM25 reranker MemPalace#1201 palace_graph.build_graph skips None metadata Conflict resolution: * mempalace/backends/chroma.py — took upstream as base (it has the igorls-review pickle-protocol docstring, thread-safety paragraph, and Path(marker).touch() style nit), then re-applied MemPalace#1094's _coerce_none_metas in query()/get() since MemPalace#1094 is still open and not yet in develop. * mempalace/mcp_server.py — took upstream's clean form. Dropped the fork-only `palace_path=` kwarg from four ChromaCollection() call sites: the kwarg was load-bearing for MemPalace#1171's per-collection write lock, but MemPalace#1171 closed in favor of MemPalace#976's mine_global_lock + daemon-strict, so the kwarg has no remaining consumer. ChromaCollection.__init__ in upstream/develop is back to (self, collection); calling it with palace_path= raised TypeError → silently swallowed by the broad except in _get_collection() → returned None → tool_status() returned _no_palace(). 41 mcp_server tests went from failing-with-KeyError to passing. * mempalace/cli.py — dropped fork-only `workers=args.workers` from the cmd_mine -> miner.mine() call. Pre-existing fork-side bug: the `--workers` argparse arg landed in 5cd14bd but miner.mine() never accepted a workers param, so production `mempalace mine` TypeError'd on every invocation. Removed the broken plumbing; tests/test_cli.py updated to match. * CHANGELOG.md — took upstream verbatim. Fork-specific changelog lives in FORK_CHANGELOG.md (canonical: docs/fork-changes.yaml). * CLAUDE.md — kept ours. Fork's CLAUDE.md is operational; upstream's added a "Design Principles / Contributing" charter, which lives in README.md on the fork. * tests/test_backends.py — took upstream's ruff-formatted line widths. docs/fork-changes.yaml flips the two MemPalace#1173 entries (hnsw-integrity-gate, hnsw-cold-start-gate) and the MemPalace#1201 entry (palace-graph-none-guard) from OPEN to MERGED 2026-04-26. MemPalace#1173 MemPalace#1177 MemPalace#1198 MemPalace#1201 added to the merged_upstream archive at the bottom. FORK_CHANGELOG.md regenerated. scripts/check-docs.sh: 4/4 clean. Test suite: 1460/1460. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream merged MemPalace#1173, MemPalace#1177, MemPalace#1198, MemPalace#1201 into develop on 2026-04-26 (picked up by the 2026-04-27 develop sync). Strike out their fork-ahead rows in CLAUDE.md and add to the "Merged into upstream" section. Update PR table accordingly: 18 merged (was 14), 7 open (was 8). Bump version note: upstream shipped v3.3.3 on 2026-04-24 (carries our MemPalace#659/MemPalace#1021); v3.3.4 prep branch in flight. README: - "tracks upstream/develop through the 2026-04-27 sync" (was 2026-04-26) - 17 fork-ahead changes (was 22) - 1,510 tests pass on main (was 1,395 — upstream brought new suites) - "Open upstream PRs" 7 entries (was 11) - Drop merged rows from "Fork-ahead — open or pending" table; keep PreCompact recovery-marker row (still fork-only) scripts/check-docs.sh: clean (90 PR refs match upstream state, 13 fork hashes resolve, FORK_CHANGELOG renders clean from YAML). docs/fork-changes.yaml: no edit needed — already had merged_date on the 4 entries from the 2026-04-26 commit `5fd15db`.
…lace#1198) Merged upstream None guard into our CJK-aware tokenizer.
Sync with upstream/develop via cherry-pick of 9 critical bugfixes: - HNSW index bloat prevention (MemPalace#1191) - SIGSEGV guards on collection reopen (MemPalace#1262, MemPalace#1289) - blob-seq marker fast-path (MemPalace#1177) - palace_graph None-metadata guard (MemPalace#1201) - palace_graph security tunnels (MemPalace#1168) - hyphenated wing name normalization (MemPalace#1197) - searcher _tokenize None guard (MemPalace#1198) - mcp_server diary topic sanitize (MemPalace#936) Tests: 1459 default + 113 benchmark + 6/7 stress all pass. (random-kill stress test was failing on dev pre-merge; not regressed.)
Problem
searcher._tokenizecallstext.lower()unconditionally. When ChromaDB returns a drawer withdocumentscontainingNone, the hybrid-rerank path raisesAttributeErrormid-search, dropping the entire query response.Observed in production on 2026-04-24 21:07:05:
This closes the gap left by #999's None-metadata audit, which fixed metadata read loops but did not extend to BM25 reranker helpers.
Fix
_tokenizeshort-circuits to[]whentextis falsy (Noneor""). ANonedoc gets BM25 score 0.0; the rest of the corpus reranks normally.Tests
Three regression tests in
TestBM25NoneSafety:test_tokenize_handles_none— direct unit guardtest_tokenize_handles_empty_string— covers the empty-string case at the same boundarytest_bm25_scores_does_not_crash_on_none_documents— load-bearing: mixed corpus with aNonemid-list, asserts score 0.0 for that doc and finite scores for the rest. This is the exact production failure shape.All existing searcher and hybrid_search tests still pass (
pytest tests/test_searcher.py tests/test_hybrid_search.py→ 30/30 against develop after cherry-pick).🤖 Generated with Claude Code