Skip to content

fix: add explicit UTF-8 encoding to read_text() calls (#776)#946

Merged
igorls merged 1 commit intoMemPalace:developfrom
mvalentsev:fix/utf8-read-text
Apr 16, 2026
Merged

fix: add explicit UTF-8 encoding to read_text() calls (#776)#946
igorls merged 1 commit intoMemPalace:developfrom
mvalentsev:fix/utf8-read-text

Conversation

@mvalentsev
Copy link
Copy Markdown
Contributor

@mvalentsev mvalentsev commented Apr 16, 2026

What

Path.read_text() without encoding defaults to platform encoding on Windows. On locales like GBK this breaks any file written as UTF-8.

8 call sites fixed across 5 files:

  • tests/test_onboarding.py (4x) -- the reported bug: onboarding writes UTF-8 (lines 315, 362), tests read without encoding
  • tests/test_instructions_cli.py (1x) -- reads UTF-8 markdown
  • mempalace/entity_registry.py (1x) -- reads JSON with non-ASCII entity names
  • mempalace/split_mega_files.py (1x) -- reads JSON with non-ASCII known names
  • mempalace/instructions_cli.py (1x) -- prints UTF-8 markdown to terminal

Why explicit encoding, not PYTHONUTF8

Considered alternatives:

  • PYTHONUTF8=1 in CI -- must be set before Python starts, so it can't go in conftest.py. Works for CI but doesn't fix local test runs on Windows. Also a global override of all I/O which can mask real encoding bugs.
  • PEP 686 (Python 3.15) -- UTF-8 will become the default, but mempalace supports >=3.9 so that's years away.
  • Explicit encoding="utf-8" per call -- PEP 597 recommendation. Works on all Python versions, all platforms, no env setup needed.

Test plan

  • CI passes on all platforms (especially test-windows)
  • pytest tests/test_onboarding.py tests/test_instructions_cli.py -v -- 52 passed locally

On Windows with non-UTF-8 locale (e.g. GBK), Path.read_text() defaults
to platform encoding, breaking onboarding tests and any source code that
reads JSON/markdown with non-ASCII content.

5 files, 8 call sites fixed.
@mvalentsev mvalentsev marked this pull request as ready for review April 16, 2026 11:10
@igorls igorls merged commit c5e249b into MemPalace:develop Apr 16, 2026
6 checks passed
igorls added a commit that referenced this pull request Apr 16, 2026
Bumps version across pyproject.toml, mempalace/version.py, README badge,
and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled
'Unreleased') and adds a 3.3.1 section covering the multi-language
entity-detection infra and the five new locales landed since 2026-04-13.

Highlights:
- Multi-language entity detection infra (#911) + script-aware word
  boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive
  locale resolution (#928) + i18n patterns wired into miner/palace/
  entity_registry (#931)
- Five new fully-supported locales: pt-br (#156), ru (#760), it (#907),
  hi (#773), id (#778)
- UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales
  (#946)
- KnowledgeGraph lock correctness (#884, #887)
- Various smaller fixes and improvements
@igorls igorls mentioned this pull request Apr 16, 2026
8 tasks
shafdev pushed a commit to shafdev/mempalace that referenced this pull request Apr 17, 2026
Bumps version across pyproject.toml, mempalace/version.py, README badge,
and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled
'Unreleased') and adds a 3.3.1 section covering the multi-language
entity-detection infra and the five new locales landed since 2026-04-13.

Highlights:
- Multi-language entity detection infra (MemPalace#911) + script-aware word
  boundaries for combining-mark scripts (MemPalace#932) + BCP 47 case-insensitive
  locale resolution (MemPalace#928) + i18n patterns wired into miner/palace/
  entity_registry (MemPalace#931)
- Five new fully-supported locales: pt-br (MemPalace#156), ru (MemPalace#760), it (MemPalace#907),
  hi (MemPalace#773), id (MemPalace#778)
- UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales
  (MemPalace#946)
- KnowledgeGraph lock correctness (MemPalace#884, MemPalace#887)
- Various smaller fixes and improvements
jphein added a commit to jphein/mempalace that referenced this pull request Apr 18, 2026
Upstream v3.3.1 headline: multi-language entity detection (PT-BR, Russian,
Italian, Hindi, Indonesian, Chinese), BCP-47 case-insensitive locale
resolution, UTF-8 encoding fix for non-UTF-8 Windows locales, combining-mark
word-boundary fix for Devanagari/Arabic/Hebrew/Thai scripts. Also absorbs
PR MemPalace#966 (honor silent_save), MemPalace#863 (precompact blocking fix), MemPalace#895/MemPalace#897
(benchmarks + README rewrite), MemPalace#946 (UTF-8 Path.read_text), and more.

Conflict resolutions:

- mempalace/hooks_cli.py: kept fork-only _desktop_toast + _mempalace_python
  (hook wrappers depend on these), adopted upstream's _get_mine_dir helper +
  _mine_sync function + _maybe_auto_ingest(transcript_path) signature, kept
  fork's silent-save MempalaceConfig branch, kept _ingest_transcript call
  in hook_precompact before the new _mine_sync so the session JSONL still
  gets captured on precompact.
- tests/test_hooks_cli.py: merged both import sets. Updated
  test_stop_hook_rejects_injected_stop_hook_active to accept either the
  silent-save systemMessage output or the legacy decision=block — the
  security property (don't treat injected stop_hook_active as truthy) holds
  for both modes.
- tests/test_miner.py: merged both import sets (CHUNK_* constants +
  load_config).
- tests/test_readme_claims.py: accepted upstream's shift from README
  parsing to website/reference/mcp-tools.md and modules.md — website docs
  are now the source of truth for the tool table.
- mempalace/entity_detector.py: accepted upstream wholesale — replaces
  hardcoded English STOPWORDS with i18n-JSON-backed patterns and backward-
  compat module constants.
- README.md: kept fork README, bumped version badge 3.3.0 → 3.3.1.

Verification: 995 tests pass, stop hook fires end-to-end ("✦ 13 memories
woven into the palace"), 10/10 fresh-process palace opens + queries OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants