fix: use i18n candidate patterns for entity extraction in miner and palace#931
Conversation
…alace entity_detector.py was refactored in MemPalace#911 to load candidate patterns from i18n locale JSON files, supporting non-Latin scripts (Cyrillic, accented Latin, etc.). But three other code paths still hardcoded the ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity names in metadata tagging, closet indexing, and registry lookups. Replace the hardcoded regex with a shared _candidate_entity_words() helper that reuses the same i18n candidate_patterns as entity_detector.
58db004 to
973bd62
Compare
|
Rebased on develop after #932 landed. candidate_patterns from get_entity_patterns() are now pre-wrapped with boundary + capture group, so _candidate_entity_words() compiles them directly without re-wrapping. Tests pass on all platforms. |
|
Hi, Severity: action required | Category: reliability How to fix: Log and avoid caching failures Agent prompt to fix - you can give this to your LLM of choice:
We noticed a couple of other issues in this PR as well — happy to share if helpful. Found by Qodo code review |
|
Fair point on the silent re.error. The try/except is intentionally defensive -- skip a broken pattern rather than crash the whole extraction pipeline. In practice the patterns are simple character classes from static JSON files ([A-Z][a-z]{1,19} and similar), so re.error is not really reachable here. A warning log would be a reasonable addition but out of scope for this PR, which just swaps ASCII-only regex for the i18n-aware version. |
Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (#911) + script-aware word boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive locale resolution (#928) + i18n patterns wired into miner/palace/ entity_registry (#931) - Five new fully-supported locales: pt-br (#156), ru (#760), it (#907), hi (#773), id (#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (#946) - KnowledgeGraph lock correctness (#884, #887) - Various smaller fixes and improvements
Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (MemPalace#911) + script-aware word boundaries for combining-mark scripts (MemPalace#932) + BCP 47 case-insensitive locale resolution (MemPalace#928) + i18n patterns wired into miner/palace/ entity_registry (MemPalace#931) - Five new fully-supported locales: pt-br (MemPalace#156), ru (MemPalace#760), it (MemPalace#907), hi (MemPalace#773), id (MemPalace#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (MemPalace#946) - KnowledgeGraph lock correctness (MemPalace#884, MemPalace#887) - Various smaller fixes and improvements
Summary
#911 refactored entity_detector.py to load candidate patterns from i18n
locale JSON, supporting non-Latin scripts. But three other code paths
still hardcode ASCII-only
[A-Z][a-z]{2,}for entity name extraction,silently missing Cyrillic, accented Latin, and other non-Latin names:
miner.py_extract_entities_for_metadata()-- drawer metadata tagspalace.pybuild_closet_lines()-- closet index entity tagsentity_registry.pyextract_unknown_candidates()-- Wikipedia lookupFor example, mining a file with "Михаил написал код" produces zero entity
metadata because
[A-Z]never matches Cyrillic uppercase.Changes
_candidate_entity_words()helper inpalace.pythat loadscandidate patterns from
get_entity_patterns()(same i18n source asentity_detector), with lazy-cached compiled regexes
diary_ingest.pyimports_extract_entities_for_metadatafrom miner,so it gets the fix automatically
entity_languagesincludes "ru"Test plan
test_entity_metadata_finds_cyrillic_namesruff check+ruff format --check: clean