Skip to content

feat(searcher): wire i18n stop words into BM25 tokenizer (#973)#977

Open
mvalentsev wants to merge 7 commits intoMemPalace:developfrom
mvalentsev:feat/searcher-i18n-stopwords
Open

feat(searcher): wire i18n stop words into BM25 tokenizer (#973)#977
mvalentsev wants to merge 7 commits intoMemPalace:developfrom
mvalentsev:feat/searcher-i18n-stopwords

Conversation

@mvalentsev
Copy link
Copy Markdown
Contributor

@mvalentsev mvalentsev commented Apr 17, 2026

Summary

Wires locale-specific BM25 stop words from mempalace/i18n/<lang>.json into the searcher tokenizer as an opt-in feature. Previously _tokenize() stripped text to \w{2,} without any language awareness, so every locale's regex.stop_words list sat unused.

Addresses the infrastructure gap reported in #973.

Changes

mempalace/i18n/__init__.py

  • New get_stopwords(lang=None) -> set[str]. When lang is given, loads that locale's JSON directly and does not touch the module-level _strings / _current_lang. When omitted, reads the currently loaded locale via get_regex(). Parses the space-separated regex.stop_words string into a lowercased set.

mempalace/config.py

  • New MempalaceConfig.lang_explicit property. Returns the locale string only when the user set MEMPALACE_LANG / MEMPAL_LANG or config.json["lang"]. Returns None otherwise. This is the opt-in signal that gates the search-side filter.
  • MempalaceConfig.lang keeps its existing shape: lang_explicit first, then first entry of entity_languages, then "en". Used for display-side output that cannot handle None.

mempalace/searcher.py

  • _tokenize(text, stop_words=frozenset()) takes an optional filter set. When empty (the default), behaviour is byte-for-byte identical to the pre-PR tokenizer.
  • _bm25_scores and _hybrid_rank gain a stop_words parameter, threaded through to the internal _tokenize calls.
  • New _resolve_stop_words(lang) helper (uncached) reads MempalaceConfig().lang_explicit when lang is None, so a mid-process env/config change takes effect on the next search. Returns an empty frozenset if no language is explicitly configured so palaces that never set one keep pre-PR scoring. Failure to construct the config logs at DEBUG and falls back to the empty set. The per-locale parse is cached inside _stopwords_for_lang(lang: str) so the hot path still avoids re-reading mempalace/i18n/<lang>.json on every call.
  • search_memories(..., lang=None) reads the resolved set up front and passes it through the drawer-grep enrichment and the hybrid re-rank.

Backwards compatibility

The filter is off by default. A palace that has never set MEMPALACE_LANG, MEMPAL_LANG, or config.json["lang"] gets the same BM25 scoring as before this PR: _resolve_stop_words(None) returns an empty frozenset, _tokenize short-circuits to the pre-PR path. Existing English palaces see no ranking change without explicit action.

To opt in, a user sets one of:

  • MEMPALACE_LANG=en (or ru, fr, de, es, pt-br, it, id, hi)
  • config.json field "lang": "en" (etc.)
  • search_memories(..., lang="en") programmatically

search() (the CLI print variant) is not affected since it does not run BM25 re-ranking.

Scope limitation: CJK languages

_TOKEN_RE = \w{2,} produces a single mega-token for Japanese, Chinese, or Korean text that has no whitespace ("プロジェクトをしました" tokenizes to ["プロジェクトをしました"]). A stop-words filter on character-agnostic tokens cannot help these locales. They need a real segmenter (MeCab, jieba, or konlpy). That segmentation work is deferred to a follow-up; this PR wires the infrastructure so the filter is ready for the locales it can help today (en, ru, fr, de, es, pt-br, it, id, hi), and the CJK case becomes a single tokenizer swap in a later change.

Test plan

  • 19 new unit tests across tests/test_i18n.py, tests/test_config.py, tests/test_searcher.py covering: get_stopwords lang override vs default, global-state non-mutation, unknown-locale empty return, every shipped locale has a non-empty set, cfg.lang and cfg.lang_explicit env/file/entity_languages/default branches (including the opt-in signal vs display-side separation), _tokenize stop-word filtering, _bm25_scores score divergence with and without filter, all-stopwords query and all-stopwords docs edge cases, _resolve_stop_words returns empty on missing explicit lang, returns empty on config exception, applies filter when explicit lang is set, _stopwords_for_lang cache hit, and a regression test that flips FakeCfg.lang_explicit between None and "ja" mid-call to prove the None-arg path reflects config changes.
  • Full suite: 981 passed, 3 deselected (one env-flaky subprocess test unrelated to this PR).
  • ruff check . clean; ruff format --check . clean against tool.ruff line-length=100 / target-version=py39.

Updates

  • 2026-04-18 (ead5125): split the cached parse into _stopwords_for_lang(lang: str) after @igorls flagged a cache-key bug. The original _resolve_stop_words(lang) was @lru_cache'd by the Optional[str] input, pinning the None-arg result for the lifetime of the process even when MEMPALACE_LANG / config.json["lang"] changed. Outer resolver now runs on every call; inner per-locale parse stays cached.
  • 2026-04-19 (39dbfef): align locale stop_words with the tokenizer. _TOKEN_RE = \w{2,} filters tokens below 2 characters before the stop-word check, so single-character particles in ja.json (は が を に で と ...) and zh-CN.json (的 了 在 是 我 有 ...) could never actually fire. Removed the phantom entries so the tokenizer contract and the stop-word list agree. Runtime behaviour unchanged.

@igorls
Copy link
Copy Markdown
Member

igorls commented Apr 18, 2026

I think there’s a cache-key bug in _resolve_stop_words(None).

Right now the function is @lru_cache'd by the raw lang argument, but when lang is None the result actually depends on mutable config/env state via MempalaceConfig().lang_explicit. That means the first call with None pins the result for the lifetime of the process.

Concrete failure mode:

  1. First search runs with no explicit language configured.
  2. _resolve_stop_words(None) caches frozenset().
  3. User later sets MEMPALACE_LANG=fr or updates config.json["lang"].
  4. Future search_memories(..., lang=None) calls still reuse the old cached empty set.

The MCP path calls search_memories(...) without a lang override, so this is reachable in normal usage.

I’d suggest either:

  • caching only explicit lang values and skipping cache when lang is None, or
  • resolving lang_explicit first and caching by that resolved string instead of the raw input arg.

I’d block on this, because it makes the new behavior depend on whichever search happened first in the process.

mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 18, 2026
_resolve_stop_words used @lru_cache keyed by the Optional[str] input
arg. When called with lang=None, the result depended on mutable state
in MempalaceConfig().lang_explicit, but the cache pinned the first
outcome for the lifetime of the process. Users who set MEMPALACE_LANG
or config.json["lang"] after the first search kept getting the stale
empty set.

Split cached parsing into _stopwords_for_lang(str) keyed by the
canonical locale code. _resolve_stop_words(Optional[str]) now runs
config resolution on every call; the per-locale parse stays cached.

Adds a regression test that toggles FakeCfg between two lang values
and asserts the second call reflects the change.

Reported by @igorls on MemPalace#977.
@mvalentsev
Copy link
Copy Markdown
Contributor Author

Sorry about the cache-key bug. Used Claude for the PR wiring and got the usual AI-shaped mess; you caught what I missed.

c1ad86a splits out a cached _stopwords_for_lang(lang: str) helper and keeps _resolve_stop_words(lang: Optional[str]) uncached, so config resolution runs on every call with lang=None and the per-locale parse stays cached. Behavior is now order-invariant: the first search no longer pins the result, later env/config changes take effect immediately.

New regression test flips FakeCfg.lang_explicit between None and "ja" mid-call and checks the second result reflects the change. Used to fail, passes now.

@mvalentsev mvalentsev force-pushed the feat/searcher-i18n-stopwords branch from c1ad86a to ead5125 Compare April 18, 2026 21:13
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 18, 2026
_resolve_stop_words used @lru_cache keyed by the Optional[str] input
arg. When called with lang=None, the result depended on mutable state
in MempalaceConfig().lang_explicit, but the cache pinned the first
outcome for the lifetime of the process. Users who set MEMPALACE_LANG
or config.json["lang"] after the first search kept getting the stale
empty set.

Split cached parsing into _stopwords_for_lang(str) keyed by the
canonical locale code. _resolve_stop_words(Optional[str]) now runs
config resolution on every call; the per-locale parse stays cached.

Adds a regression test that toggles FakeCfg between two lang values
and asserts the second call reflects the change.

Reported by @igorls on MemPalace#977.
@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, The stop-word filter in _tokenize() cannot remove single-character stop words because the tokenizer regex only emits tokens of length ≥2, so many shipped locale stop words (e.g., Japanese particles) are dead entries and filtering is ineffective for those locales.

Severity: remediation recommended | Category: correctness

How to fix: Align tokenizer and stopwords

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

The new stop-word feature is ineffective for locales whose regex.stop_words contains single-character entries because _tokenize() only emits \w{2,} tokens.

Issue Context

  • Tokenization currently uses _TOKEN_RE = re.compile(r"\w{2,}").
  • Shipped locale stop-word lists (e.g. ja, zh-CN) include many 1-character particles.

Fix Focus Areas

Choose one consistent approach:

  • Update tokenization (opt-in) so single-character tokens can be filtered for locales that declare them.
  • Or, update locale JSON regex.stop_words to only include tokens the tokenizer can actually emit.

Target code:

  • mempalace/searcher.py[34-62]
  • mempalace/i18n/ja.json[38-42]
  • mempalace/i18n/zh-CN.json[38-42]

Qodo code review - free for open-source.

@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, search_memories() documents that omitting lang uses MempalaceConfig().lang (including entity_languages fallback), but the implementation uses lang_explicit via _resolve_stop_words(None), so stop-word filtering will not activate in cases the docstring claims it will.

Severity: remediation recommended | Category: maintainability

How to fix: Update docstring to lang_explicit

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

search_memories(..., lang=None) is documented as using MempalaceConfig().lang, but the stop-word logic uses MempalaceConfig().lang_explicit (opt-in).

Issue Context

This mismatch can cause callers to believe stop-word filtering will activate via entity_languages fallback when it will not.

Fix Focus Areas

  • Update the lang: parameter docstring to match the actual resolution behavior (explicit-only unless lang param is provided).

Target code:

  • mempalace/searcher.py[355-383]
  • mempalace/searcher.py[77-99]

Found by Qodo code review

mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 19, 2026
…tring

_tokenize() emits only \w{2,} tokens, so single-character entries in
ja.json and zh-CN.json regex.stop_words could never match. Removed
single-char kana from ja and single-char hanzi from zh-CN; remaining
≥2-char entries are what _tokenize actually produces.

Also updated the search_memories(lang=) docstring to reflect the opt-in
lang_explicit resolution implemented in fb1a133; prior text described
the pre-opt-in MempalaceConfig().lang fallback chain.

Reported by qodo-ai-reviewer on MemPalace#977.
@mvalentsev
Copy link
Copy Markdown
Contributor Author

Addressed both review comments in 39dbfef: dropped single-char CJK entries from ja/zh-CN stop_words so the list matches what \w{2,} tokenizer can emit, and synced the lang= docstring with the opt-in resolution path. CJK morphological segmentation stays a follow-up.

@igorls
Copy link
Copy Markdown
Member

igorls commented Apr 21, 2026

After #945 and #1001 merged, this branch now shows CONFLICTING / DIRTY. Likely overlap points:

Could you rebase onto develop and resolve? No behavior changes needed — just reconcile the file-level overlaps. After rebase, the opt-in filter also newly applies to de / es / fr / zh-CN / zh-TW stopwords (since those locales now have populated entity data), which is the desired outcome.

I'll merge once CI is green on the rebased branch.

mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 21, 2026
_resolve_stop_words used @lru_cache keyed by the Optional[str] input
arg. When called with lang=None, the result depended on mutable state
in MempalaceConfig().lang_explicit, but the cache pinned the first
outcome for the lifetime of the process. Users who set MEMPALACE_LANG
or config.json["lang"] after the first search kept getting the stale
empty set.

Split cached parsing into _stopwords_for_lang(str) keyed by the
canonical locale code. _resolve_stop_words(Optional[str]) now runs
config resolution on every call; the per-locale parse stays cached.

Adds a regression test that toggles FakeCfg between two lang values
and asserts the second call reflects the change.

Reported by @igorls on MemPalace#977.
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 21, 2026
…tring

_tokenize() emits only \w{2,} tokens, so single-character entries in
ja.json and zh-CN.json regex.stop_words could never match. Removed
single-char kana from ja and single-char hanzi from zh-CN; remaining
≥2-char entries are what _tokenize actually produces.

Also updated the search_memories(lang=) docstring to reflect the opt-in
lang_explicit resolution implemented in fb1a133; prior text described
the pre-opt-in MempalaceConfig().lang fallback chain.

Reported by qodo-ai-reviewer on MemPalace#977.
@mvalentsev mvalentsev force-pushed the feat/searcher-i18n-stopwords branch from 39dbfef to 95f5eb2 Compare April 21, 2026 06:27
@mvalentsev
Copy link
Copy Markdown
Contributor Author

Rebased onto develop. Conflicts were in tests/test_i18n.py (non-overlapping additions: kept both blocks); zh-CN.json and ja.json applied cleanly through 3-way merge (entity section from #945 preserved, stop_words trim reapplied). All six CI jobs green. Ready when you are.

@igorls igorls added enhancement New feature or request area/search Search and retrieval area/i18n Multilingual, Unicode, non-English embeddings labels Apr 24, 2026
@mvalentsev mvalentsev force-pushed the feat/searcher-i18n-stopwords branch from 95f5eb2 to a542c94 Compare April 25, 2026 13:33
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 25, 2026
_resolve_stop_words used @lru_cache keyed by the Optional[str] input
arg. When called with lang=None, the result depended on mutable state
in MempalaceConfig().lang_explicit, but the cache pinned the first
outcome for the lifetime of the process. Users who set MEMPALACE_LANG
or config.json["lang"] after the first search kept getting the stale
empty set.

Split cached parsing into _stopwords_for_lang(str) keyed by the
canonical locale code. _resolve_stop_words(Optional[str]) now runs
config resolution on every call; the per-locale parse stays cached.

Adds a regression test that toggles FakeCfg between two lang values
and asserts the second call reflects the change.

Reported by @igorls on MemPalace#977.
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 25, 2026
…tring

_tokenize() emits only \w{2,} tokens, so single-character entries in
ja.json and zh-CN.json regex.stop_words could never match. Removed
single-char kana from ja and single-char hanzi from zh-CN; remaining
≥2-char entries are what _tokenize actually produces.

Also updated the search_memories(lang=) docstring to reflect the opt-in
lang_explicit resolution implemented in fb1a133; prior text described
the pre-opt-in MempalaceConfig().lang fallback chain.

Reported by qodo-ai-reviewer on MemPalace#977.
@mvalentsev mvalentsev force-pushed the feat/searcher-i18n-stopwords branch from a542c94 to d6e3608 Compare April 26, 2026 21:53
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 26, 2026
_resolve_stop_words used @lru_cache keyed by the Optional[str] input
arg. When called with lang=None, the result depended on mutable state
in MempalaceConfig().lang_explicit, but the cache pinned the first
outcome for the lifetime of the process. Users who set MEMPALACE_LANG
or config.json["lang"] after the first search kept getting the stale
empty set.

Split cached parsing into _stopwords_for_lang(str) keyed by the
canonical locale code. _resolve_stop_words(Optional[str]) now runs
config resolution on every call; the per-locale parse stays cached.

Adds a regression test that toggles FakeCfg between two lang values
and asserts the second call reflects the change.

Reported by @igorls on MemPalace#977.
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 26, 2026
…tring

_tokenize() emits only \w{2,} tokens, so single-character entries in
ja.json and zh-CN.json regex.stop_words could never match. Removed
single-char kana from ja and single-char hanzi from zh-CN; remaining
≥2-char entries are what _tokenize actually produces.

Also updated the search_memories(lang=) docstring to reflect the opt-in
lang_explicit resolution implemented in fb1a133; prior text described
the pre-opt-in MempalaceConfig().lang fallback chain.

Reported by qodo-ai-reviewer on MemPalace#977.
@mvalentsev mvalentsev force-pushed the feat/searcher-i18n-stopwords branch from d6e3608 to fe60e7c Compare April 27, 2026 06:23
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 27, 2026
_resolve_stop_words used @lru_cache keyed by the Optional[str] input
arg. When called with lang=None, the result depended on mutable state
in MempalaceConfig().lang_explicit, but the cache pinned the first
outcome for the lifetime of the process. Users who set MEMPALACE_LANG
or config.json["lang"] after the first search kept getting the stale
empty set.

Split cached parsing into _stopwords_for_lang(str) keyed by the
canonical locale code. _resolve_stop_words(Optional[str]) now runs
config resolution on every call; the per-locale parse stays cached.

Adds a regression test that toggles FakeCfg between two lang values
and asserts the second call reflects the change.

Reported by @igorls on MemPalace#977.
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 27, 2026
…tring

_tokenize() emits only \w{2,} tokens, so single-character entries in
ja.json and zh-CN.json regex.stop_words could never match. Removed
single-char kana from ja and single-char hanzi from zh-CN; remaining
≥2-char entries are what _tokenize actually produces.

Also updated the search_memories(lang=) docstring to reflect the opt-in
lang_explicit resolution implemented in fb1a133; prior text described
the pre-opt-in MempalaceConfig().lang fallback chain.

Reported by qodo-ai-reviewer on MemPalace#977.
@mvalentsev mvalentsev force-pushed the feat/searcher-i18n-stopwords branch from fe60e7c to 766f6b5 Compare May 3, 2026 12:42
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request May 3, 2026
_resolve_stop_words used @lru_cache keyed by the Optional[str] input
arg. When called with lang=None, the result depended on mutable state
in MempalaceConfig().lang_explicit, but the cache pinned the first
outcome for the lifetime of the process. Users who set MEMPALACE_LANG
or config.json["lang"] after the first search kept getting the stale
empty set.

Split cached parsing into _stopwords_for_lang(str) keyed by the
canonical locale code. _resolve_stop_words(Optional[str]) now runs
config resolution on every call; the per-locale parse stays cached.

Adds a regression test that toggles FakeCfg between two lang values
and asserts the second call reflects the change.

Reported by @igorls on MemPalace#977.
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request May 3, 2026
…tring

_tokenize() emits only \w{2,} tokens, so single-character entries in
ja.json and zh-CN.json regex.stop_words could never match. Removed
single-char kana from ja and single-char hanzi from zh-CN; remaining
≥2-char entries are what _tokenize actually produces.

Also updated the search_memories(lang=) docstring to reflect the opt-in
lang_explicit resolution implemented in fb1a133; prior text described
the pre-opt-in MempalaceConfig().lang fallback chain.

Reported by qodo-ai-reviewer on MemPalace#977.
@mvalentsev
Copy link
Copy Markdown
Contributor Author

mvalentsev commented May 3, 2026

Heads-up: rebased on develop tip just now and noticed that #1306 (hybrid candidate union, merged 2026-05-02) added several new BM25 scoring sites that this PR did not cover -- the original wiring landed before #1306 existed. Without the plumbing, paths silently bypass locale stop_words even when MEMPALACE_LANG is set:

554a216 hoists stop_words = _resolve_stop_words(lang) once before the vector_disabled branch and threads it through _apply_candidate_strategy, _merge_bm25_union_candidates, and _bm25_only_via_sqlite into _bm25_scores. 3686154 does the same for the CLI search() function. The FTS5 candidate-selection _tokenize at the top of _bm25_only_via_sqlite is left alone -- the trigram FTS5 index already mismatches the \w{2,} regex, so changing that selection layer is out of this PR's scope.

Four new tests in tests/test_searcher.py pin the propagation:

  • test_bm25_only_via_sqlite_forwards_stop_words_to_bm25_scores
  • test_apply_candidate_strategy_forwards_stop_words_to_merger
  • test_search_memories_vector_disabled_uses_resolved_stop_words
  • test_search_cli_threads_resolved_stop_words_to_hybrid_rank

Edit 2026-05-03 evening: two follow-ups landed since the above note.

c61ff57 split the lru_cache onto _stopwords_for_canonical(canonical_lang) and made the outer _stopwords_for_lang(lang) canonicalize via _canonical_lang(lang) or lang.lower() before the cache lookup. That collapses "en" / "EN" / "En" / "en-US" into one cache slot instead of four; without canonicalization a tenant rotating through capitalizations could exhaust the maxsize=16 cache on the same locale. New tests/test_searcher.py::test_resolve_stop_words_canonicalizes_cache_key pins it.

d9b9970 moved the env-var fast path (MEMPALACE_LANG / MEMPAL_LANG) into _resolve_stop_words(None) itself: when either is set the resolver returns directly without constructing MempalaceConfig (which json.loads config.json from disk on the hot search path). Adds an autouse fixture in the test module that clears _stopwords_for_canonical.cache_clear() plus strips both env vars between tests so the suite no longer needs five manual cache_clear() calls scattered across tests, and the test_resolve_stop_words_uses_env_var_before_config test pins that MempalaceConfig() is not constructed when the env var is set (TripwireCfg sentinel).

mvalentsev added 7 commits May 6, 2026 10:34
Tokenizer now filters locale-specific stop words when a language is
configured, using the regex.stop_words lists already shipped in each
mempalace/i18n/<lang>.json. Adds MempalaceConfig.lang (env, config.json,
entity_languages[0], fallback to en) and a cached _resolve_stop_words
helper so the search hot path stays O(1) after the first lookup. Default
stop_words is an empty frozenset, preserving the pre-change behaviour
when nothing is configured.

CJK languages without whitespace still tokenize to a single mega-token;
segmentation is deferred to a follow-up.
Split the language resolver so the opt-in signal is separable from the
display-side fallback:

- MempalaceConfig.lang_explicit returns the locale only when the user
  set MEMPALACE_LANG / MEMPAL_LANG or config.json["lang"]. It does not
  inherit from entity_languages.
- MempalaceConfig.lang keeps its existing behaviour (env, file,
  entity_languages[0], "en") for localized output.
- _resolve_stop_words(None) now reads lang_explicit and returns an empty
  frozenset when no language is explicitly configured. Palaces that
  never set a language get byte-for-byte pre-PR tokenization. Explicit
  config still activates the filter.

Tests cover lang_explicit resolution branches and _resolve_stop_words
behaviour on missing config, exception, and explicit lang.
_resolve_stop_words used @lru_cache keyed by the Optional[str] input
arg. When called with lang=None, the result depended on mutable state
in MempalaceConfig().lang_explicit, but the cache pinned the first
outcome for the lifetime of the process. Users who set MEMPALACE_LANG
or config.json["lang"] after the first search kept getting the stale
empty set.

Split cached parsing into _stopwords_for_lang(str) keyed by the
canonical locale code. _resolve_stop_words(Optional[str]) now runs
config resolution on every call; the per-locale parse stays cached.

Adds a regression test that toggles FakeCfg between two lang values
and asserts the second call reflects the change.

Reported by @igorls on MemPalace#977.
…tring

_tokenize() emits only \w{2,} tokens, so single-character entries in
ja.json and zh-CN.json regex.stop_words could never match. Removed
single-char kana from ja and single-char hanzi from zh-CN; remaining
≥2-char entries are what _tokenize actually produces.

Also updated the search_memories(lang=) docstring to reflect the opt-in
lang_explicit resolution implemented in fb1a133; prior text described
the pre-opt-in MempalaceConfig().lang fallback chain.

Reported by qodo-ai-reviewer on MemPalace#977.
MemPalace#1306 (hybrid candidate union, merged 2026-05-02) added a second BM25
scoring site inside _bm25_only_via_sqlite that this PR did not cover --
the original wiring landed before MemPalace#1306 existed. Without the plumbing,
two paths silently bypass locale stop_words even when the palace has
MEMPALACE_LANG set:

- vector_disabled=True (MemPalace#1222 fallback): BM25-only scoring runs without
  filtering.
- candidate_strategy="union": BM25 candidates merged into the rerank
  pool come from a tokenizer that ignored the configured locale, so
  the merged-in entries fight the lang-aware _hybrid_rank rerank.

Resolution moves once: stop_words = _resolve_stop_words(lang) is
hoisted before the vector_disabled branch and threaded through
_apply_candidate_strategy, _merge_bm25_union_candidates, and
_bm25_only_via_sqlite into _bm25_scores.

The FTS5 candidate-selection _tokenize at the top of
_bm25_only_via_sqlite is left untouched -- chromadb's FTS5 index
is built with a trigram tokenizer that already mismatches our
\\w{2,} regex, so dropping further tokens there changes selection
semantics in ways outside this PR's scope.

Three tests pin the propagation: a sqlite-backed test for
_bm25_only_via_sqlite -> _bm25_scores, a unit spy on
_apply_candidate_strategy -> merger, and an end-to-end on
search_memories(vector_disabled=True) -> _bm25_only_via_sqlite.
The MCP `search_memories` path resolves `_resolve_stop_words(lang)` and
threads it into `_hybrid_rank` so MEMPALACE_LANG filters BM25 scoring.
The `mempalace search ...` CLI handler calls the same `_hybrid_rank`
but with no `stop_words` argument, so locale stop-word filtering only
applied to the MCP surface. CLI users with MEMPALACE_LANG set silently
got unfiltered scoring.

Resolve once at the top of the CLI handler with `_resolve_stop_words(None)`
(the helper picks up MempalaceConfig().lang_explicit from env/config) and
pass through to `_hybrid_rank`. The empty-frozenset default of
`_resolve_stop_words` keeps unconfigured palaces byte-for-byte identical.
- _resolve_stop_words(None) now reads MEMPALACE_LANG / MEMPAL_LANG env
  vars before constructing MempalaceConfig. The previous path called
  MempalaceConfig() per search, which json.load's config.json from disk
  on every hot-path query. Env-var fast path skips the file read in the
  common case of explicit-locale palaces.

- Canonicalize lang before lru_cache lookup. Variants like 'en' / 'EN'
  / 'En' previously consumed three separate cache slots pointing at
  the same set; with maxsize=16 a tenant rotating through
  capitalizations could thrash. Splits the cache onto a new
  _stopwords_for_canonical(canonical_lang) keyed by the canonicalized
  form; _stopwords_for_lang stays as the public wrapper.

- Add an autouse fixture to test_searcher.py that clears the lru_cache
  and strips MEMPALACE_LANG / MEMPAL_LANG before each test. The five
  prior tests each did a manual cache_clear() at the top, which made
  it easy for a future test to forget the call and silently inherit
  state from whatever ran before it.

Two new tests pin the new behavior:
  test_resolve_stop_words_uses_env_var_before_config (MempalaceConfig
  must not be constructed when env is set)
  test_resolve_stop_words_canonicalizes_cache_key ('en' / 'EN' / 'En'
  hit one cache slot)
@mvalentsev mvalentsev force-pushed the feat/searcher-i18n-stopwords branch from c61ff57 to 4ec4faf Compare May 6, 2026 05:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/i18n Multilingual, Unicode, non-English embeddings area/search Search and retrieval enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants