feat(init): wire confirmed entities into the miner's known-entities registry#1157
Conversation
…egistry
The init step's output was a dead file. miner.py has always read
`~/.mempalace/known_entities.json` to tag drawer metadata with
recognized names, but nothing ever wrote it — so init's careful
manifest + git + LLM detection work stopped at `<project>/entities.json`
and never reached the path that actually uses it.
Measured delta on a representative prose snippet (eight sentences
mentioning six real people and four real projects):
- Empty registry: 0 entities recognized (multi-word names fail the
frequency threshold; lowercase/hyphenated project names don't match
the CamelCase regex).
- Registry populated by init: 12 entities recognized (all correct, zero
false positives).
Every recognized name becomes a semicolon-separated metadata tag on the
drawer, which ChromaDB uses for entity-filtered search.
Implementation:
- `miner.add_to_known_entities({category: [names]})` reads the existing
registry, unions each category (case-insensitively, preserving first-
seen casing), and writes back. The function is tolerant of the two
on-disk shapes miner already supports: list of names, or dict mapping
name → code (dialect-style). In the dict case new names are added as
keys with `None` values so existing codes aren't overwritten.
- Invalidates the in-process mtime cache so same-process callers
(`cmd_init` → `cmd_mine` in one run) see the write immediately.
- Writes with `ensure_ascii=False` so non-ASCII names (Gergő Móricz,
Arturo Domínguez, etc.) stay readable on disk.
- Chmods 0o600 — the registry mirrors confirm-step PII from the user's
git authors and local paths.
cmd_init now calls this at the end of the confirm-entities step, after
the per-project `entities.json` is written (which is kept as an audit
trail the user can inspect or hand-edit). The per-project file is still
excluded from mining via `SKIP_FILENAMES` from the earlier fix.
17 new tests cover: fresh-file creation, list-category union, case-
insensitive dedup, preservation of untouched categories, dict-format
registries, malformed/non-dict file recovery, cache invalidation,
unicode round-trip, and an end-to-end verification that the miner's
`_extract_entities_for_metadata` picks up every registered name.
There was a problem hiding this comment.
Pull request overview
Wires mempalace init’s confirmed entities into the global known-entity registry (~/.mempalace/known_entities.json) so the miner can tag drawer metadata with those names during mining, while still keeping <project>/entities.json as a per-project audit trail.
Changes:
- Add
miner.add_to_known_entities()to merge confirmed entities into the global registry and invalidate the in-process registry cache. - Update
cmd_initto writeentities.jsonwithensure_ascii=Falseand to calladd_to_known_entities()after confirmation. - Add a dedicated offline test suite covering registry merge behavior, error tolerance, cache invalidation, Unicode, and miner metadata extraction.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
mempalace/miner.py |
Adds registry merge helper (add_to_known_entities) and cache invalidation for same-process init→mine workflows. |
mempalace/cli.py |
Updates init to persist confirmed entities as UTF-friendly JSON and merge them into the global registry. |
tests/test_known_entities_registry.py |
Adds comprehensive tests for registry creation/merge semantics and end-to-end miner metadata tagging recall. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| elif isinstance(current, dict): | ||
| for n in names: | ||
| if n and n not in current: | ||
| current[n] = None |
| # global registry the miner reads at mine time. | ||
| if confirmed["people"] or confirmed["projects"]: | ||
| entities_path = Path(args.dir).expanduser().resolve() / "entities.json" | ||
| with open(entities_path, "w") as f: |
|
@copilot apply changes based on the comments in this thread |
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/76794fde-2383-4674-ab36-f89ad803eeb2 Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>
Applied the requested review fixes in |
…to develop MemPalace#1148, MemPalace#1150, and MemPalace#1157 were reviewed and merged on GitHub, but the two stacked children landed on their parent feature branches (now stale) rather than on develop. Only MemPalace#1148's commits reached develop via the direct merge. Release PR MemPalace#1159 (develop → main for v3.3.3) is therefore missing the LLM refinement, Claude-conversation scanner, and miner- registry wire-up that were ostensibly part of the release. This merge brings the stale `feat/llm-entity-refine` branch (which contains the rolled-up merge commit for MemPalace#1157 → MemPalace#1150 → everything below) into develop so the release tag includes it. No code changes here — only history recovery.
Adds entries to the 3.3.3 section for the work that landed via MemPalace#1148, MemPalace#1150, MemPalace#1157, and MemPalace#1175 (rescued from stacked feature branches into develop via MemPalace#1175). Without these entries the 3.3.3 release notes on main would advertise only the hook/diary/search fixes that made it to develop through the first direct merge. Covers: - Manifest + git-author entity detection (MemPalace#1148) - Regex detector accuracy improvements (MemPalace#1148) - Optional --llm classification with Ollama / openai-compat / Anthropic provider abstraction and interactive UX (MemPalace#1150) - Claude Code conversation scanner (MemPalace#1150) - Init → miner registry wire-up so confirmed entities actually reach drawer metadata tagging (MemPalace#1157) - Case-insensitive project dedup across all sources (MemPalace#1175) - `mempalace mine` skips the generated entities.json artifact
Summary
The init step's entity output was a dead file.
miner.pyhas always read~/.mempalace/known_entities.jsonto tag drawer metadata with recognized names, but nothing ever wrote it — so every improvement to init detection (manifest/git/regex/LLM) stopped at<project>/entities.jsonand never reached the path that actually uses it.This wires init → registry. Per-project file is kept as an audit trail.
Measured value
On a representative prose snippet (eight sentences mentioning six real people and four real projects):
Multi-word names (
Alice Example,Bob Sample) fail the frequency-threshold fallback because each word only appears once. Lowercase / hyphenated project names (my-lib,foo-bar) don't match the CamelCase regex. Both categories were completely invisible to the miner until now. Every recognized name becomes a semicolon-separated tag on the drawer, which ChromaDB uses for entity-filtered search.Why stacked on #1150
Logically independent from the LLM refinement, but each earlier PR in the stack improves the input to the registry: #1148 added manifest/git authors, #1150 adds LLM-classified topics/people. This PR ensures all of that work reaches the one code path that uses it. Merging order is linear: #1148 → #1150 → this.
Implementation
miner.add_to_known_entities({category: [names]}) -> str(new):name → code(dialect-style). In the dict case, new names are added as keys withNonevalues so existing codes aren't overwritten.{people: [...]}does not clobber an existingplacesorprojectscategory.cmd_init→cmd_minein one run sees the write immediately.ensure_ascii=Falseso non-ASCII names stay readable.chmod 0o600— the registry mirrors the user's confirm-step PII.cmd_initnow calls it at the end of the confirm-entities step, after the per-projectentities.jsonis written.Tests
17 new tests (
tests/test_known_entities_registry.py), all offline:_extract_entities_for_metadatapicks up every registered nameFull suite: 1221 passed, ruff clean.
Test plan
uv run pytest tests/ --ignore=tests/benchmarks— full suite passesruff check mempalace/ tests/— cleanruff format --check mempalace/ tests/— cleanmempalace init <repo>thenmempalace mine <repo>, confirm drawer metadata contains the registered names