feat(searcher): wire i18n stop words into BM25 tokenizer (#973)#977
feat(searcher): wire i18n stop words into BM25 tokenizer (#973)#977mvalentsev wants to merge 7 commits intoMemPalace:developfrom
Conversation
|
I think there’s a cache-key bug in Right now the function is Concrete failure mode:
The MCP path calls I’d suggest either:
I’d block on this, because it makes the new behavior depend on whichever search happened first in the process. |
_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.
|
Sorry about the cache-key bug. Used Claude for the PR wiring and got the usual AI-shaped mess; you caught what I missed. c1ad86a splits out a cached New regression test flips |
c1ad86a to
ead5125
Compare
_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.
|
Hi, The stop-word filter in _tokenize() cannot remove single-character stop words because the tokenizer regex only emits tokens of length ≥2, so many shipped locale stop words (e.g., Japanese particles) are dead entries and filtering is ineffective for those locales. Severity: remediation recommended | Category: correctness How to fix: Align tokenizer and stopwords Agent prompt to fix - you can give this to your LLM of choice:
Qodo code review - free for open-source. |
|
Hi, search_memories() documents that omitting lang uses MempalaceConfig().lang (including entity_languages fallback), but the implementation uses lang_explicit via _resolve_stop_words(None), so stop-word filtering will not activate in cases the docstring claims it will. Severity: remediation recommended | Category: maintainability How to fix: Update docstring to lang_explicit Agent prompt to fix - you can give this to your LLM of choice:
Found by Qodo code review |
…tring
_tokenize() emits only \w{2,} tokens, so single-character entries in
ja.json and zh-CN.json regex.stop_words could never match. Removed
single-char kana from ja and single-char hanzi from zh-CN; remaining
≥2-char entries are what _tokenize actually produces.
Also updated the search_memories(lang=) docstring to reflect the opt-in
lang_explicit resolution implemented in fb1a133; prior text described
the pre-opt-in MempalaceConfig().lang fallback chain.
Reported by qodo-ai-reviewer on MemPalace#977.
|
Addressed both review comments in 39dbfef: dropped single-char CJK entries from ja/zh-CN stop_words so the list matches what \w{2,} tokenizer can emit, and synced the |
|
After #945 and #1001 merged, this branch now shows
Could you rebase onto I'll merge once CI is green on the rebased branch. |
_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.
…tring
_tokenize() emits only \w{2,} tokens, so single-character entries in
ja.json and zh-CN.json regex.stop_words could never match. Removed
single-char kana from ja and single-char hanzi from zh-CN; remaining
≥2-char entries are what _tokenize actually produces.
Also updated the search_memories(lang=) docstring to reflect the opt-in
lang_explicit resolution implemented in fb1a133; prior text described
the pre-opt-in MempalaceConfig().lang fallback chain.
Reported by qodo-ai-reviewer on MemPalace#977.
39dbfef to
95f5eb2
Compare
|
Rebased onto develop. Conflicts were in tests/test_i18n.py (non-overlapping additions: kept both blocks); zh-CN.json and ja.json applied cleanly through 3-way merge (entity section from #945 preserved, stop_words trim reapplied). All six CI jobs green. Ready when you are. |
95f5eb2 to
a542c94
Compare
_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.
…tring
_tokenize() emits only \w{2,} tokens, so single-character entries in
ja.json and zh-CN.json regex.stop_words could never match. Removed
single-char kana from ja and single-char hanzi from zh-CN; remaining
≥2-char entries are what _tokenize actually produces.
Also updated the search_memories(lang=) docstring to reflect the opt-in
lang_explicit resolution implemented in fb1a133; prior text described
the pre-opt-in MempalaceConfig().lang fallback chain.
Reported by qodo-ai-reviewer on MemPalace#977.
a542c94 to
d6e3608
Compare
_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.
…tring
_tokenize() emits only \w{2,} tokens, so single-character entries in
ja.json and zh-CN.json regex.stop_words could never match. Removed
single-char kana from ja and single-char hanzi from zh-CN; remaining
≥2-char entries are what _tokenize actually produces.
Also updated the search_memories(lang=) docstring to reflect the opt-in
lang_explicit resolution implemented in fb1a133; prior text described
the pre-opt-in MempalaceConfig().lang fallback chain.
Reported by qodo-ai-reviewer on MemPalace#977.
d6e3608 to
fe60e7c
Compare
_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.
…tring
_tokenize() emits only \w{2,} tokens, so single-character entries in
ja.json and zh-CN.json regex.stop_words could never match. Removed
single-char kana from ja and single-char hanzi from zh-CN; remaining
≥2-char entries are what _tokenize actually produces.
Also updated the search_memories(lang=) docstring to reflect the opt-in
lang_explicit resolution implemented in fb1a133; prior text described
the pre-opt-in MempalaceConfig().lang fallback chain.
Reported by qodo-ai-reviewer on MemPalace#977.
fe60e7c to
766f6b5
Compare
_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.
…tring
_tokenize() emits only \w{2,} tokens, so single-character entries in
ja.json and zh-CN.json regex.stop_words could never match. Removed
single-char kana from ja and single-char hanzi from zh-CN; remaining
≥2-char entries are what _tokenize actually produces.
Also updated the search_memories(lang=) docstring to reflect the opt-in
lang_explicit resolution implemented in fb1a133; prior text described
the pre-opt-in MempalaceConfig().lang fallback chain.
Reported by qodo-ai-reviewer on MemPalace#977.
|
Heads-up: rebased on develop tip just now and noticed that #1306 (hybrid candidate union, merged 2026-05-02) added several new BM25 scoring sites that this PR did not cover -- the original wiring landed before #1306 existed. Without the plumbing, paths silently bypass locale stop_words even when
Four new tests in
Edit 2026-05-03 evening: two follow-ups landed since the above note.
|
Tokenizer now filters locale-specific stop words when a language is configured, using the regex.stop_words lists already shipped in each mempalace/i18n/<lang>.json. Adds MempalaceConfig.lang (env, config.json, entity_languages[0], fallback to en) and a cached _resolve_stop_words helper so the search hot path stays O(1) after the first lookup. Default stop_words is an empty frozenset, preserving the pre-change behaviour when nothing is configured. CJK languages without whitespace still tokenize to a single mega-token; segmentation is deferred to a follow-up.
Split the language resolver so the opt-in signal is separable from the display-side fallback: - MempalaceConfig.lang_explicit returns the locale only when the user set MEMPALACE_LANG / MEMPAL_LANG or config.json["lang"]. It does not inherit from entity_languages. - MempalaceConfig.lang keeps its existing behaviour (env, file, entity_languages[0], "en") for localized output. - _resolve_stop_words(None) now reads lang_explicit and returns an empty frozenset when no language is explicitly configured. Palaces that never set a language get byte-for-byte pre-PR tokenization. Explicit config still activates the filter. Tests cover lang_explicit resolution branches and _resolve_stop_words behaviour on missing config, exception, and explicit lang.
_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.
…tring
_tokenize() emits only \w{2,} tokens, so single-character entries in
ja.json and zh-CN.json regex.stop_words could never match. Removed
single-char kana from ja and single-char hanzi from zh-CN; remaining
≥2-char entries are what _tokenize actually produces.
Also updated the search_memories(lang=) docstring to reflect the opt-in
lang_explicit resolution implemented in fb1a133; prior text described
the pre-opt-in MempalaceConfig().lang fallback chain.
Reported by qodo-ai-reviewer on MemPalace#977.
MemPalace#1306 (hybrid candidate union, merged 2026-05-02) added a second BM25 scoring site inside _bm25_only_via_sqlite that this PR did not cover -- the original wiring landed before MemPalace#1306 existed. Without the plumbing, two paths silently bypass locale stop_words even when the palace has MEMPALACE_LANG set: - vector_disabled=True (MemPalace#1222 fallback): BM25-only scoring runs without filtering. - candidate_strategy="union": BM25 candidates merged into the rerank pool come from a tokenizer that ignored the configured locale, so the merged-in entries fight the lang-aware _hybrid_rank rerank. Resolution moves once: stop_words = _resolve_stop_words(lang) is hoisted before the vector_disabled branch and threaded through _apply_candidate_strategy, _merge_bm25_union_candidates, and _bm25_only_via_sqlite into _bm25_scores. The FTS5 candidate-selection _tokenize at the top of _bm25_only_via_sqlite is left untouched -- chromadb's FTS5 index is built with a trigram tokenizer that already mismatches our \\w{2,} regex, so dropping further tokens there changes selection semantics in ways outside this PR's scope. Three tests pin the propagation: a sqlite-backed test for _bm25_only_via_sqlite -> _bm25_scores, a unit spy on _apply_candidate_strategy -> merger, and an end-to-end on search_memories(vector_disabled=True) -> _bm25_only_via_sqlite.
The MCP `search_memories` path resolves `_resolve_stop_words(lang)` and threads it into `_hybrid_rank` so MEMPALACE_LANG filters BM25 scoring. The `mempalace search ...` CLI handler calls the same `_hybrid_rank` but with no `stop_words` argument, so locale stop-word filtering only applied to the MCP surface. CLI users with MEMPALACE_LANG set silently got unfiltered scoring. Resolve once at the top of the CLI handler with `_resolve_stop_words(None)` (the helper picks up MempalaceConfig().lang_explicit from env/config) and pass through to `_hybrid_rank`. The empty-frozenset default of `_resolve_stop_words` keeps unconfigured palaces byte-for-byte identical.
- _resolve_stop_words(None) now reads MEMPALACE_LANG / MEMPAL_LANG env
vars before constructing MempalaceConfig. The previous path called
MempalaceConfig() per search, which json.load's config.json from disk
on every hot-path query. Env-var fast path skips the file read in the
common case of explicit-locale palaces.
- Canonicalize lang before lru_cache lookup. Variants like 'en' / 'EN'
/ 'En' previously consumed three separate cache slots pointing at
the same set; with maxsize=16 a tenant rotating through
capitalizations could thrash. Splits the cache onto a new
_stopwords_for_canonical(canonical_lang) keyed by the canonicalized
form; _stopwords_for_lang stays as the public wrapper.
- Add an autouse fixture to test_searcher.py that clears the lru_cache
and strips MEMPALACE_LANG / MEMPAL_LANG before each test. The five
prior tests each did a manual cache_clear() at the top, which made
it easy for a future test to forget the call and silently inherit
state from whatever ran before it.
Two new tests pin the new behavior:
test_resolve_stop_words_uses_env_var_before_config (MempalaceConfig
must not be constructed when env is set)
test_resolve_stop_words_canonicalizes_cache_key ('en' / 'EN' / 'En'
hit one cache slot)
c61ff57 to
4ec4faf
Compare
Summary
Wires locale-specific BM25 stop words from
mempalace/i18n/<lang>.jsoninto the searcher tokenizer as an opt-in feature. Previously_tokenize()stripped text to\w{2,}without any language awareness, so every locale'sregex.stop_wordslist sat unused.Addresses the infrastructure gap reported in #973.
Changes
mempalace/i18n/__init__.pyget_stopwords(lang=None) -> set[str]. Whenlangis given, loads that locale's JSON directly and does not touch the module-level_strings/_current_lang. When omitted, reads the currently loaded locale viaget_regex(). Parses the space-separatedregex.stop_wordsstring into a lowercased set.mempalace/config.pyMempalaceConfig.lang_explicitproperty. Returns the locale string only when the user setMEMPALACE_LANG/MEMPAL_LANGorconfig.json["lang"]. ReturnsNoneotherwise. This is the opt-in signal that gates the search-side filter.MempalaceConfig.langkeeps its existing shape:lang_explicitfirst, then first entry ofentity_languages, then"en". Used for display-side output that cannot handleNone.mempalace/searcher.py_tokenize(text, stop_words=frozenset())takes an optional filter set. When empty (the default), behaviour is byte-for-byte identical to the pre-PR tokenizer._bm25_scoresand_hybrid_rankgain astop_wordsparameter, threaded through to the internal_tokenizecalls._resolve_stop_words(lang)helper (uncached) readsMempalaceConfig().lang_explicitwhenlangisNone, so a mid-process env/config change takes effect on the next search. Returns an empty frozenset if no language is explicitly configured so palaces that never set one keep pre-PR scoring. Failure to construct the config logs at DEBUG and falls back to the empty set. The per-locale parse is cached inside_stopwords_for_lang(lang: str)so the hot path still avoids re-readingmempalace/i18n/<lang>.jsonon every call.search_memories(..., lang=None)reads the resolved set up front and passes it through the drawer-grep enrichment and the hybrid re-rank.Backwards compatibility
The filter is off by default. A palace that has never set
MEMPALACE_LANG,MEMPAL_LANG, orconfig.json["lang"]gets the same BM25 scoring as before this PR:_resolve_stop_words(None)returns an empty frozenset,_tokenizeshort-circuits to the pre-PR path. Existing English palaces see no ranking change without explicit action.To opt in, a user sets one of:
MEMPALACE_LANG=en(orru,fr,de,es,pt-br,it,id,hi)config.jsonfield"lang": "en"(etc.)search_memories(..., lang="en")programmaticallysearch()(the CLI print variant) is not affected since it does not run BM25 re-ranking.Scope limitation: CJK languages
_TOKEN_RE = \w{2,}produces a single mega-token for Japanese, Chinese, or Korean text that has no whitespace ("プロジェクトをしました"tokenizes to["プロジェクトをしました"]). A stop-words filter on character-agnostic tokens cannot help these locales. They need a real segmenter (MeCab, jieba, or konlpy). That segmentation work is deferred to a follow-up; this PR wires the infrastructure so the filter is ready for the locales it can help today (en, ru, fr, de, es, pt-br, it, id, hi), and the CJK case becomes a single tokenizer swap in a later change.Test plan
tests/test_i18n.py,tests/test_config.py,tests/test_searcher.pycovering:get_stopwordslang override vs default, global-state non-mutation, unknown-locale empty return, every shipped locale has a non-empty set,cfg.langandcfg.lang_explicitenv/file/entity_languages/default branches (including the opt-in signal vs display-side separation),_tokenizestop-word filtering,_bm25_scoresscore divergence with and without filter, all-stopwords query and all-stopwords docs edge cases,_resolve_stop_wordsreturns empty on missing explicit lang, returns empty on config exception, applies filter when explicit lang is set,_stopwords_for_langcache hit, and a regression test that flipsFakeCfg.lang_explicitbetweenNoneand"ja"mid-call to prove theNone-arg path reflects config changes.ruff check .clean;ruff format --check .clean againsttool.ruffline-length=100/target-version=py39.Updates
ead5125): split the cached parse into_stopwords_for_lang(lang: str)after @igorls flagged a cache-key bug. The original_resolve_stop_words(lang)was@lru_cache'd by theOptional[str]input, pinning theNone-arg result for the lifetime of the process even whenMEMPALACE_LANG/config.json["lang"]changed. Outer resolver now runs on every call; inner per-locale parse stays cached.39dbfef): align localestop_wordswith the tokenizer._TOKEN_RE = \w{2,}filters tokens below 2 characters before the stop-word check, so single-character particles inja.json(は が を に で と ...) andzh-CN.json(的 了 在 是 我 有 ...) could never actually fire. Removed the phantom entries so the tokenizer contract and the stop-word list agree. Runtime behaviour unchanged.