Skip to content

feat(i18n): add Hebrew language support#1031

Closed
shaibachar wants to merge 7 commits intoMemPalace:developfrom
shaibachar:add-heb-lang
Closed

feat(i18n): add Hebrew language support#1031
shaibachar wants to merge 7 commits intoMemPalace:developfrom
shaibachar:add-heb-lang

Conversation

@shaibachar
Copy link
Copy Markdown

What does this PR do?

Adds Hebrew language support to the i18n/entity-detection layer.

  • Adds a new Hebrew locale file at /abs/path/mempalace/i18n/he.json
  • Extends i18n coverage tests to include Hebrew sample text in /abs/path/mempalace/tests/test_i18n.py
  • Adds Hebrew-specific entity detection tests in /abs/path/mempalace/tests/test_entity_detector.py

How to test

Run:

python -m pytest tests/test_i18n.py tests/test_entity_detector.py -v
python -m pytest tests/ -v

Expected:

  • Hebrew locale loads successfully
  • Hebrew sample compression test passes
  • Hebrew entity candidate extraction and person-verb scoring tests pass

Checklist

  • Tests pass (python -m pytest tests/ -v)
  • No hardcoded paths
  • Linter passes (ruff check .)

igorls and others added 7 commits April 14, 2026 12:35
release: sync develop → main (v3.3.0 manifest, SECURITY.md, version guard, Pages CNAME)
Bumps version across pyproject.toml, mempalace/version.py, README badge,
and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled
'Unreleased') and adds a 3.3.1 section covering the multi-language
entity-detection infra and the five new locales landed since 2026-04-13.

Highlights:
- Multi-language entity detection infra (MemPalace#911) + script-aware word
  boundaries for combining-mark scripts (MemPalace#932) + BCP 47 case-insensitive
  locale resolution (MemPalace#928) + i18n patterns wired into miner/palace/
  entity_registry (MemPalace#931)
- Five new fully-supported locales: pt-br (MemPalace#156), ru (MemPalace#760), it (MemPalace#907),
  hi (MemPalace#773), id (MemPalace#778)
- UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales
  (MemPalace#946)
- KnowledgeGraph lock correctness (MemPalace#884, MemPalace#887)
- Various smaller fixes and improvements
Advisor caught: initial boundary (962776c..develop) skipped PRs that
landed on develop after v3.3.0 tag but before the sync-back merge.
Adds entries for MemPalace#871 MEMPAL_VERBOSE, MemPalace#811 research() local-only
default, MemPalace#866 init .gitignore, MemPalace#864 MCP stdout redirect, MemPalace#863
precompact hook, MemPalace#865 searcher empty results, MemPalace#831 cold-start palace,
MemPalace#862 init help, MemPalace#815 Slack provenance, MemPalace#840 save hook auto-mine.
Also drops the awkward caveat on MemPalace#846 created_at — it's post-v3.3.0.
version-guard workflow checks five sources must agree:
mempalace/version.py, pyproject.toml, .claude-plugin/marketplace.json,
.claude-plugin/plugin.json, .codex-plugin/plugin.json.

Initial release commit missed the three plugin manifests.
…gin-manifests

release: bump plugin manifests to 3.3.1
@shaibachar shaibachar changed the title Add heb lang feat(i18n): add Hebrew language support Apr 19, 2026
@mvalentsev
Copy link
Copy Markdown
Contributor

A couple of things I noticed while reading the diff:

Scope: this PR is bundling a release bump with the locale addition. The Hebrew scope is he.json plus the two test files, but the diff also touches pyproject.toml, mempalace/version.py, uv.lock, the README.md badge, all three plugin manifests (.claude-plugin/marketplace.json, .claude-plugin/plugin.json, .codex-plugin/plugin.json), and adds a full [3.3.1] CHANGELOG block that attributes 20+ other merged PRs (#156, #760, #907, #773, #778, #911, #932, #928, #931, #946, #758, #876, ...). Those version bumps and release notes are usually maintainer work, and keeping them here will conflict with whatever 3.3.1 looks like when it is actually cut. Reverting the non-Hebrew files would leave a cleaner PR to review.

regex.stop_words catches domain nouns: the current list filters ארמון (palace), אגף (wing), ארון (closet), מגירה (drawer), plus generic terms like קובץ, קוד, בדיקה, פרויקט, עבודה. BM25 will strip those from Hebrew queries and documents, so searches that use the palace vocabulary or common project nouns will miss their natural matches. Tightening the list to function words (prepositions, pronouns, copulas) matches what the other locales ship.

Duplicate את entry: "את" appears twice in the entity.stopwords array (as the 2nd item and again as the 22nd). The same token also repeats inside the space-separated regex.stop_words string. Stopword lookup dedupes on match, so the repeats are harmless but likely unintended.

boundary_chars: Hebrew has the same \b problem as Devanagari and Arabic that #932 introduced boundary_chars for. Without it, \b{name}\b person-verb patterns will not fire reliably when Hebrew names adjoin Hebrew text (ן, ם, י endings in particular). Adding a boundary_chars field with the Hebrew block would let the loader expand \b correctly.

@shaibachar shaibachar closed this Apr 19, 2026
@shaibachar
Copy link
Copy Markdown
Author

i will do cleanup and apply comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants