feat(searcher): wire i18n stop words into BM25 tokenizer (#973) by mvalentsev · Pull Request #977 · MemPalace/mempalace

mvalentsev · 2026-04-17T19:09:22Z

Summary

Wires locale-specific BM25 stop words from mempalace/i18n/<lang>.json into the searcher tokenizer as an opt-in feature. Previously _tokenize() stripped text to \w{2,} without any language awareness, so every locale's regex.stop_words list sat unused.

Addresses the infrastructure gap reported in #973.

Changes

mempalace/i18n/__init__.py

New get_stopwords(lang=None) -> set[str]. When lang is given, loads that locale's JSON directly and does not touch the module-level _strings / _current_lang. When omitted, reads the currently loaded locale via get_regex(). Parses the space-separated regex.stop_words string into a lowercased set.

mempalace/config.py

New MempalaceConfig.lang_explicit property. Returns the locale string only when the user set MEMPALACE_LANG / MEMPAL_LANG or config.json["lang"]. Returns None otherwise. This is the opt-in signal that gates the search-side filter.
MempalaceConfig.lang keeps its existing shape: lang_explicit first, then first entry of entity_languages, then "en". Used for display-side output that cannot handle None.

mempalace/searcher.py

_tokenize(text, stop_words=frozenset()) takes an optional filter set. When empty (the default), behaviour is byte-for-byte identical to the pre-PR tokenizer.
_bm25_scores and _hybrid_rank gain a stop_words parameter, threaded through to the internal _tokenize calls.
New _resolve_stop_words(lang) helper (uncached) reads MempalaceConfig().lang_explicit when lang is None, so a mid-process env/config change takes effect on the next search. Returns an empty frozenset if no language is explicitly configured so palaces that never set one keep pre-PR scoring. Failure to construct the config logs at DEBUG and falls back to the empty set. The per-locale parse is cached inside _stopwords_for_lang(lang: str) so the hot path still avoids re-reading mempalace/i18n/<lang>.json on every call.
search_memories(..., lang=None) reads the resolved set up front and passes it through the drawer-grep enrichment and the hybrid re-rank.

Backwards compatibility

The filter is off by default. A palace that has never set MEMPALACE_LANG, MEMPAL_LANG, or config.json["lang"] gets the same BM25 scoring as before this PR: _resolve_stop_words(None) returns an empty frozenset, _tokenize short-circuits to the pre-PR path. Existing English palaces see no ranking change without explicit action.

To opt in, a user sets one of:

MEMPALACE_LANG=en (or ru, fr, de, es, pt-br, it, id, hi)
config.json field "lang": "en" (etc.)
search_memories(..., lang="en") programmatically

search() (the CLI print variant) is not affected since it does not run BM25 re-ranking.

Scope limitation: CJK languages

_TOKEN_RE = \w{2,} produces a single mega-token for Japanese, Chinese, or Korean text that has no whitespace ("プロジェクトをしました" tokenizes to ["プロジェクトをしました"]). A stop-words filter on character-agnostic tokens cannot help these locales. They need a real segmenter (MeCab, jieba, or konlpy). That segmentation work is deferred to a follow-up; this PR wires the infrastructure so the filter is ready for the locales it can help today (en, ru, fr, de, es, pt-br, it, id, hi), and the CJK case becomes a single tokenizer swap in a later change.

Test plan

19 new unit tests across tests/test_i18n.py, tests/test_config.py, tests/test_searcher.py covering: get_stopwords lang override vs default, global-state non-mutation, unknown-locale empty return, every shipped locale has a non-empty set, cfg.lang and cfg.lang_explicit env/file/entity_languages/default branches (including the opt-in signal vs display-side separation), _tokenize stop-word filtering, _bm25_scores score divergence with and without filter, all-stopwords query and all-stopwords docs edge cases, _resolve_stop_words returns empty on missing explicit lang, returns empty on config exception, applies filter when explicit lang is set, _stopwords_for_lang cache hit, and a regression test that flips FakeCfg.lang_explicit between None and "ja" mid-call to prove the None-arg path reflects config changes.
Full suite: 981 passed, 3 deselected (one env-flaky subprocess test unrelated to this PR).
ruff check . clean; ruff format --check . clean against tool.ruff line-length=100 / target-version=py39.

Updates

2026-04-18 (ead5125): split the cached parse into _stopwords_for_lang(lang: str) after @igorls flagged a cache-key bug. The original _resolve_stop_words(lang) was @lru_cache'd by the Optional[str] input, pinning the None-arg result for the lifetime of the process even when MEMPALACE_LANG / config.json["lang"] changed. Outer resolver now runs on every call; inner per-locale parse stays cached.
2026-04-19 (39dbfef): align locale stop_words with the tokenizer. _TOKEN_RE = \w{2,} filters tokens below 2 characters before the stop-word check, so single-character particles in ja.json (はがをにでと ...) and zh-CN.json (的了在是我有 ...) could never actually fire. Removed the phantom entries so the tokenizer contract and the stop-word list agree. Runtime behaviour unchanged.

igorls · 2026-04-18T05:05:54Z

I think there’s a cache-key bug in _resolve_stop_words(None).

Right now the function is @lru_cache'd by the raw lang argument, but when lang is None the result actually depends on mutable config/env state via MempalaceConfig().lang_explicit. That means the first call with None pins the result for the lifetime of the process.

Concrete failure mode:

First search runs with no explicit language configured.
_resolve_stop_words(None) caches frozenset().
User later sets MEMPALACE_LANG=fr or updates config.json["lang"].
Future search_memories(..., lang=None) calls still reuse the old cached empty set.

The MCP path calls search_memories(...) without a lang override, so this is reachable in normal usage.

I’d suggest either:

caching only explicit lang values and skipping cache when lang is None, or
resolving lang_explicit first and caching by that resolved string instead of the raw input arg.

I’d block on this, because it makes the new behavior depend on whichever search happened first in the process.

@igorls

_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.

mvalentsev · 2026-04-18T11:36:46Z

Sorry about the cache-key bug. Used Claude for the PR wiring and got the usual AI-shaped mess; you caught what I missed.

c1ad86a splits out a cached _stopwords_for_lang(lang: str) helper and keeps _resolve_stop_words(lang: Optional[str]) uncached, so config resolution runs on every call with lang=None and the per-locale parse stays cached. Behavior is now order-invariant: the first search no longer pins the result, later env/config changes take effect immediately.

New regression test flips FakeCfg.lang_explicit between None and "ja" mid-call and checks the second result reflects the change. Used to fail, passes now.

@igorls

_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.

Qodo-Free-For-OSS · 2026-04-19T10:42:12Z

Hi, The stop-word filter in _tokenize() cannot remove single-character stop words because the tokenizer regex only emits tokens of length ≥2, so many shipped locale stop words (e.g., Japanese particles) are dead entries and filtering is ineffective for those locales.

Severity: remediation recommended | Category: correctness

How to fix: Align tokenizer and stopwords

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

The new stop-word feature is ineffective for locales whose regex.stop_words contains single-character entries because _tokenize() only emits \w{2,} tokens.

Issue Context

Tokenization currently uses _TOKEN_RE = re.compile(r"\w{2,}").

Shipped locale stop-word lists (e.g. ja, zh-CN) include many 1-character particles.

Fix Focus Areas

Choose one consistent approach:

Update tokenization (opt-in) so single-character tokens can be filtered for locales that declare them.

Or, update locale JSON regex.stop_words to only include tokens the tokenizer can actually emit.

Target code:

mempalace/searcher.py[34-62]

mempalace/i18n/ja.json[38-42]

mempalace/i18n/zh-CN.json[38-42]

Qodo code review - free for open-source.

Qodo-Free-For-OSS · 2026-04-19T10:46:10Z

Hi, search_memories() documents that omitting lang uses MempalaceConfig().lang (including entity_languages fallback), but the implementation uses lang_explicit via _resolve_stop_words(None), so stop-word filtering will not activate in cases the docstring claims it will.

Severity: remediation recommended | Category: maintainability

How to fix: Update docstring to lang_explicit

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

search_memories(..., lang=None) is documented as using MempalaceConfig().lang, but the stop-word logic uses MempalaceConfig().lang_explicit (opt-in).

Issue Context

This mismatch can cause callers to believe stop-word filtering will activate via entity_languages fallback when it will not.

Fix Focus Areas

Update the lang: parameter docstring to match the actual resolution behavior (explicit-only unless lang param is provided).

Target code:

mempalace/searcher.py[355-383]

mempalace/searcher.py[77-99]

Found by Qodo code review

…tring _tokenize() emits only \w{2,} tokens, so single-character entries in ja.json and zh-CN.json regex.stop_words could never match. Removed single-char kana from ja and single-char hanzi from zh-CN; remaining ≥2-char entries are what _tokenize actually produces. Also updated the search_memories(lang=) docstring to reflect the opt-in lang_explicit resolution implemented in fb1a133; prior text described the pre-opt-in MempalaceConfig().lang fallback chain. Reported by qodo-ai-reviewer on MemPalace#977.

mvalentsev · 2026-04-19T12:02:45Z

Addressed both review comments in 39dbfef: dropped single-char CJK entries from ja/zh-CN stop_words so the list matches what \w{2,} tokenizer can emit, and synced the lang= docstring with the opt-in resolution path. CJK morphological segmentation stays a follow-up.

igorls · 2026-04-21T03:57:22Z

After #945 and #1001 merged, this branch now shows CONFLICTING / DIRTY. Likely overlap points:

mempalace/i18n/__init__.py — feat(i18n): add entity detection to German, Spanish, and French locales #1001 added a schema-invariant test path; your get_stopwords addition likely textually abuts those changes.
mempalace/i18n/zh-CN.json — feat(i18n): add Traditional + Simplified Chinese entity detection #945 added an entity section here; your PR tweaked stop_words.
mempalace/i18n/ja.json — minor stop_words edit may overlap with recent changes.

Could you rebase onto develop and resolve? No behavior changes needed — just reconcile the file-level overlaps. After rebase, the opt-in filter also newly applies to de / es / fr / zh-CN / zh-TW stopwords (since those locales now have populated entity data), which is the desired outcome.

I'll merge once CI is green on the rebased branch.

@igorls

_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.

…tring _tokenize() emits only \w{2,} tokens, so single-character entries in ja.json and zh-CN.json regex.stop_words could never match. Removed single-char kana from ja and single-char hanzi from zh-CN; remaining ≥2-char entries are what _tokenize actually produces. Also updated the search_memories(lang=) docstring to reflect the opt-in lang_explicit resolution implemented in fb1a133; prior text described the pre-opt-in MempalaceConfig().lang fallback chain. Reported by qodo-ai-reviewer on MemPalace#977.

mvalentsev · 2026-04-21T06:36:26Z

Rebased onto develop. Conflicts were in tests/test_i18n.py (non-overlapping additions: kept both blocks); zh-CN.json and ja.json applied cleanly through 3-way merge (entity section from #945 preserved, stop_words trim reapplied). All six CI jobs green. Ready when you are.

@igorls

_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.

…tring _tokenize() emits only \w{2,} tokens, so single-character entries in ja.json and zh-CN.json regex.stop_words could never match. Removed single-char kana from ja and single-char hanzi from zh-CN; remaining ≥2-char entries are what _tokenize actually produces. Also updated the search_memories(lang=) docstring to reflect the opt-in lang_explicit resolution implemented in fb1a133; prior text described the pre-opt-in MempalaceConfig().lang fallback chain. Reported by qodo-ai-reviewer on MemPalace#977.

@igorls

_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.

…tring _tokenize() emits only \w{2,} tokens, so single-character entries in ja.json and zh-CN.json regex.stop_words could never match. Removed single-char kana from ja and single-char hanzi from zh-CN; remaining ≥2-char entries are what _tokenize actually produces. Also updated the search_memories(lang=) docstring to reflect the opt-in lang_explicit resolution implemented in fb1a133; prior text described the pre-opt-in MempalaceConfig().lang fallback chain. Reported by qodo-ai-reviewer on MemPalace#977.

@igorls

_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.

…tring _tokenize() emits only \w{2,} tokens, so single-character entries in ja.json and zh-CN.json regex.stop_words could never match. Removed single-char kana from ja and single-char hanzi from zh-CN; remaining ≥2-char entries are what _tokenize actually produces. Also updated the search_memories(lang=) docstring to reflect the opt-in lang_explicit resolution implemented in fb1a133; prior text described the pre-opt-in MempalaceConfig().lang fallback chain. Reported by qodo-ai-reviewer on MemPalace#977.

@igorls

_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.

…tring _tokenize() emits only \w{2,} tokens, so single-character entries in ja.json and zh-CN.json regex.stop_words could never match. Removed single-char kana from ja and single-char hanzi from zh-CN; remaining ≥2-char entries are what _tokenize actually produces. Also updated the search_memories(lang=) docstring to reflect the opt-in lang_explicit resolution implemented in fb1a133; prior text described the pre-opt-in MempalaceConfig().lang fallback chain. Reported by qodo-ai-reviewer on MemPalace#977.

mvalentsev · 2026-05-03T13:04:37Z

Heads-up: rebased on develop tip just now and noticed that #1306 (hybrid candidate union, merged 2026-05-02) added several new BM25 scoring sites that this PR did not cover -- the original wiring landed before #1306 existed. Without the plumbing, paths silently bypass locale stop_words even when MEMPALACE_LANG is set:

vector_disabled=True (HNSW max_elements frozen at 16K while collection grows to 100K+ entries — MCP server segfaults on every tool call #1222 fallback) -- BM25-only scoring via _bm25_only_via_sqlite runs unfiltered.
candidate_strategy="union" -- BM25 candidates merged into the rerank pool come from an unfiltered tokenizer, so the merged-in entries fight the lang-aware _hybrid_rank rerank.
CLI search() -- feat(searcher): candidate_strategy="union" — BM25 candidates joined with vector pool before hybrid rerank #1306 added a _hybrid_rank(hits, query) call into the mempalace search ... print variant, but the call passed no stop_words. The CLI was the last _hybrid_rank caller still without it, so users got filtered scoring on the MCP path but unfiltered on the CLI. That also makes the body's "Backwards compatibility" line stale -- "search() is not affected since it does not run BM25 re-ranking" is no longer true post-feat(searcher): candidate_strategy="union" — BM25 candidates joined with vector pool before hybrid rerank #1306.

554a216 hoists stop_words = _resolve_stop_words(lang) once before the vector_disabled branch and threads it through _apply_candidate_strategy, _merge_bm25_union_candidates, and _bm25_only_via_sqlite into _bm25_scores. 3686154 does the same for the CLI search() function. The FTS5 candidate-selection _tokenize at the top of _bm25_only_via_sqlite is left alone -- the trigram FTS5 index already mismatches the \w{2,} regex, so changing that selection layer is out of this PR's scope.

Four new tests in tests/test_searcher.py pin the propagation:

test_bm25_only_via_sqlite_forwards_stop_words_to_bm25_scores
test_apply_candidate_strategy_forwards_stop_words_to_merger
test_search_memories_vector_disabled_uses_resolved_stop_words
test_search_cli_threads_resolved_stop_words_to_hybrid_rank

Edit 2026-05-03 evening: two follow-ups landed since the above note.

c61ff57 split the lru_cache onto _stopwords_for_canonical(canonical_lang) and made the outer _stopwords_for_lang(lang) canonicalize via _canonical_lang(lang) or lang.lower() before the cache lookup. That collapses "en" / "EN" / "En" / "en-US" into one cache slot instead of four; without canonicalization a tenant rotating through capitalizations could exhaust the maxsize=16 cache on the same locale. New tests/test_searcher.py::test_resolve_stop_words_canonicalizes_cache_key pins it.

d9b9970 moved the env-var fast path (MEMPALACE_LANG / MEMPAL_LANG) into _resolve_stop_words(None) itself: when either is set the resolver returns directly without constructing MempalaceConfig (which json.loads config.json from disk on the hot search path). Adds an autouse fixture in the test module that clears _stopwords_for_canonical.cache_clear() plus strips both env vars between tests so the suite no longer needs five manual cache_clear() calls scattered across tests, and the test_resolve_stop_words_uses_env_var_before_config test pins that MempalaceConfig() is not constructed when the env var is set (TripwireCfg sentinel).

Tokenizer now filters locale-specific stop words when a language is configured, using the regex.stop_words lists already shipped in each mempalace/i18n/<lang>.json. Adds MempalaceConfig.lang (env, config.json, entity_languages[0], fallback to en) and a cached _resolve_stop_words helper so the search hot path stays O(1) after the first lookup. Default stop_words is an empty frozenset, preserving the pre-change behaviour when nothing is configured. CJK languages without whitespace still tokenize to a single mega-token; segmentation is deferred to a follow-up.

Split the language resolver so the opt-in signal is separable from the display-side fallback: - MempalaceConfig.lang_explicit returns the locale only when the user set MEMPALACE_LANG / MEMPAL_LANG or config.json["lang"]. It does not inherit from entity_languages. - MempalaceConfig.lang keeps its existing behaviour (env, file, entity_languages[0], "en") for localized output. - _resolve_stop_words(None) now reads lang_explicit and returns an empty frozenset when no language is explicitly configured. Palaces that never set a language get byte-for-byte pre-PR tokenization. Explicit config still activates the filter. Tests cover lang_explicit resolution branches and _resolve_stop_words behaviour on missing config, exception, and explicit lang.

@igorls

_resolve_stop_words used @lru_cache keyed by the Optional[str] input arg. When called with lang=None, the result depended on mutable state in MempalaceConfig().lang_explicit, but the cache pinned the first outcome for the lifetime of the process. Users who set MEMPALACE_LANG or config.json["lang"] after the first search kept getting the stale empty set. Split cached parsing into _stopwords_for_lang(str) keyed by the canonical locale code. _resolve_stop_words(Optional[str]) now runs config resolution on every call; the per-locale parse stays cached. Adds a regression test that toggles FakeCfg between two lang values and asserts the second call reflects the change. Reported by @igorls on MemPalace#977.

…tring _tokenize() emits only \w{2,} tokens, so single-character entries in ja.json and zh-CN.json regex.stop_words could never match. Removed single-char kana from ja and single-char hanzi from zh-CN; remaining ≥2-char entries are what _tokenize actually produces. Also updated the search_memories(lang=) docstring to reflect the opt-in lang_explicit resolution implemented in fb1a133; prior text described the pre-opt-in MempalaceConfig().lang fallback chain. Reported by qodo-ai-reviewer on MemPalace#977.

MemPalace#1306 (hybrid candidate union, merged 2026-05-02) added a second BM25 scoring site inside _bm25_only_via_sqlite that this PR did not cover -- the original wiring landed before MemPalace#1306 existed. Without the plumbing, two paths silently bypass locale stop_words even when the palace has MEMPALACE_LANG set: - vector_disabled=True (MemPalace#1222 fallback): BM25-only scoring runs without filtering. - candidate_strategy="union": BM25 candidates merged into the rerank pool come from a tokenizer that ignored the configured locale, so the merged-in entries fight the lang-aware _hybrid_rank rerank. Resolution moves once: stop_words = _resolve_stop_words(lang) is hoisted before the vector_disabled branch and threaded through _apply_candidate_strategy, _merge_bm25_union_candidates, and _bm25_only_via_sqlite into _bm25_scores. The FTS5 candidate-selection _tokenize at the top of _bm25_only_via_sqlite is left untouched -- chromadb's FTS5 index is built with a trigram tokenizer that already mismatches our \\w{2,} regex, so dropping further tokens there changes selection semantics in ways outside this PR's scope. Three tests pin the propagation: a sqlite-backed test for _bm25_only_via_sqlite -> _bm25_scores, a unit spy on _apply_candidate_strategy -> merger, and an end-to-end on search_memories(vector_disabled=True) -> _bm25_only_via_sqlite.

The MCP `search_memories` path resolves `_resolve_stop_words(lang)` and threads it into `_hybrid_rank` so MEMPALACE_LANG filters BM25 scoring. The `mempalace search ...` CLI handler calls the same `_hybrid_rank` but with no `stop_words` argument, so locale stop-word filtering only applied to the MCP surface. CLI users with MEMPALACE_LANG set silently got unfiltered scoring. Resolve once at the top of the CLI handler with `_resolve_stop_words(None)` (the helper picks up MempalaceConfig().lang_explicit from env/config) and pass through to `_hybrid_rank`. The empty-frozenset default of `_resolve_stop_words` keeps unconfigured palaces byte-for-byte identical.

- _resolve_stop_words(None) now reads MEMPALACE_LANG / MEMPAL_LANG env vars before constructing MempalaceConfig. The previous path called MempalaceConfig() per search, which json.load's config.json from disk on every hot-path query. Env-var fast path skips the file read in the common case of explicit-locale palaces. - Canonicalize lang before lru_cache lookup. Variants like 'en' / 'EN' / 'En' previously consumed three separate cache slots pointing at the same set; with maxsize=16 a tenant rotating through capitalizations could thrash. Splits the cache onto a new _stopwords_for_canonical(canonical_lang) keyed by the canonicalized form; _stopwords_for_lang stays as the public wrapper. - Add an autouse fixture to test_searcher.py that clears the lru_cache and strips MEMPALACE_LANG / MEMPAL_LANG before each test. The five prior tests each did a manual cache_clear() at the top, which made it easy for a future test to forget the call and silently inherit state from whatever ran before it. Two new tests pin the new behavior: test_resolve_stop_words_uses_env_var_before_config (MempalaceConfig must not be constructed when env is set) test_resolve_stop_words_canonicalizes_cache_key ('en' / 'EN' / 'En' hit one cache slot)

mvalentsev mentioned this pull request Apr 17, 2026

earcher.py does not use i18n regex patterns — Japanese (and other non-English) search is degraded #973

Closed

mvalentsev marked this pull request as ready for review April 17, 2026 19:15

mvalentsev requested review from bensig, igorls and milla-jovovich as code owners April 17, 2026 19:15

mvalentsev force-pushed the feat/searcher-i18n-stopwords branch from c1ad86a to ead5125 Compare April 18, 2026 21:13

igorls mentioned this pull request Apr 21, 2026

feat(i18n): add Vietnamese language support #1059

Closed

3 tasks

mvalentsev force-pushed the feat/searcher-i18n-stopwords branch from 39dbfef to 95f5eb2 Compare April 21, 2026 06:27

igorls added enhancement New feature or request area/search Search and retrieval area/i18n Multilingual, Unicode, non-English embeddings labels Apr 24, 2026

mvalentsev force-pushed the feat/searcher-i18n-stopwords branch from 95f5eb2 to a542c94 Compare April 25, 2026 13:33

mvalentsev force-pushed the feat/searcher-i18n-stopwords branch from a542c94 to d6e3608 Compare April 26, 2026 21:53

mvalentsev force-pushed the feat/searcher-i18n-stopwords branch from d6e3608 to fe60e7c Compare April 27, 2026 06:23

mvalentsev force-pushed the feat/searcher-i18n-stopwords branch from fe60e7c to 766f6b5 Compare May 3, 2026 12:42

mvalentsev added 7 commits May 6, 2026 10:34

mvalentsev force-pushed the feat/searcher-i18n-stopwords branch from c61ff57 to 4ec4faf Compare May 6, 2026 05:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(searcher): wire i18n stop words into BM25 tokenizer (#973)#977

feat(searcher): wire i18n stop words into BM25 tokenizer (#973)#977
mvalentsev wants to merge 7 commits intoMemPalace:developfrom
mvalentsev:feat/searcher-i18n-stopwords

mvalentsev commented Apr 17, 2026 •

edited

Loading

Uh oh!

igorls commented Apr 18, 2026

Uh oh!

mvalentsev commented Apr 18, 2026

Uh oh!

Qodo-Free-For-OSS commented Apr 19, 2026

Issue description

Issue Context

Fix Focus Areas

Uh oh!

Qodo-Free-For-OSS commented Apr 19, 2026

Issue description

Issue Context

Fix Focus Areas

Uh oh!

mvalentsev commented Apr 19, 2026

Uh oh!

igorls commented Apr 21, 2026 •

edited

Loading

Uh oh!

mvalentsev commented Apr 21, 2026

Uh oh!

mvalentsev commented May 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mvalentsev commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Backwards compatibility

Scope limitation: CJK languages

Test plan

Updates

Uh oh!

igorls commented Apr 18, 2026

Uh oh!

mvalentsev commented Apr 18, 2026

Uh oh!

Qodo-Free-For-OSS commented Apr 19, 2026

Issue description

Issue Context

Fix Focus Areas

Uh oh!

Qodo-Free-For-OSS commented Apr 19, 2026

Issue description

Issue Context

Fix Focus Areas

Uh oh!

mvalentsev commented Apr 19, 2026

Uh oh!

igorls commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mvalentsev commented Apr 21, 2026

Uh oh!

mvalentsev commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mvalentsev commented Apr 17, 2026 •

edited

Loading

igorls commented Apr 21, 2026 •

edited

Loading

mvalentsev commented May 3, 2026 •

edited

Loading