feat(init): optional LLM-assisted entity refinement + Claude Code convo scanner (phase 2) by igorls · Pull Request #1150 · MemPalace/mempalace

igorls · 2026-04-24T03:47:49Z

Summary

Implements #1149 — phase-2 of the init entity detection work, stacked on top of #1148.

Adds an opt-in LLM refinement step (mempalace init --llm) that takes the candidate set produced by phase-1 detection and reclassifies each entity as PERSON / PROJECT / TOPIC / COMMON_WORD / AMBIGUOUS. Default behaviour is unchanged — no LLM, no network, no API keys required.

Also adds a deterministic convo_scanner that parses ~/.claude/projects/ session directories into project entities by reading each session's cwd metadata, avoiding the lossy slug-decoding problem.

Why stacked on #1148

The two PRs build incrementally:

feat(init): scan manifests and git authors for real entity signal (v1) #1148 replaces regex-only detection with manifest + git-author signal
feat(init): optional LLM-assisted entity classification (phase 2) #1149 adds LLM refinement on top of that candidate set

Reviewing them together is easier than reviewing #1149 in isolation. Merge order: #1148 first, then this one rebases onto develop.

What's in this PR

mempalace/convo_scanner.py — Parse Claude Code conversation directories into ProjectInfo.

Reads cwd from session JSONL records for accurate project names (slug decoding is lossy — can't distinguish foo-bar one-segment from foo/bar two-segments).
Falls back to slug-decoding only if JSONL is malformed.
Dedup by name, prefers the entry with more sessions.

mempalace/llm_client.py — Pluggable provider abstraction, no external SDKs (stdlib urllib only).

ollama (default local, zero-API).
openai-compat for OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Together, Fireworks.
anthropic for the Messages API.
JSON-mode plumbing normalized across providers.
Fast check_available() probe before first use.

mempalace/llm_refine.py — The refinement step.

Batches 25 candidates per call, collects up to 3 context lines each capped at 240 chars — bounds total input to ~50-100K tokens on any corpus size.
Interactive progress on stderr (overwrite-line bar with current candidate name).
Ctrl-C cancels cleanly and returns whatever was classified so far; partial result is safe to pass straight to confirm_entities.
Defensive response parser: handles label/type keys, case variants, top-level list vs wrapped object, markdown code fences. Unknown labels become AMBIGUOUS so the user reviews them.
Manifest-backed projects (conf >= 0.95) and git-authored people are not sent to the LLM — they're already authoritative.

CLI flags (opt-in, default zero-API):

mempalace init <dir> --llm
mempalace init <dir> --llm --llm-provider ollama --llm-model gemma4:e4b
mempalace init <dir> --llm --llm-provider openai-compat \
                     --llm-endpoint http://localhost:1234/v1 \
                     --llm-model <model>
mempalace init <dir> --llm --llm-provider anthropic \
                     --llm-model claude-haiku-4-5

API keys fall back to $OPENAI_API_KEY / $ANTHROPIC_API_KEY when not passed explicitly.

Tests

76 new unit tests across the three new modules, all running offline with mocked HTTP and a FakeProvider. Covers:

Manifest/JSONL parsing with malformed-input tolerance
Every provider's JSON-mode plumbing and error paths
Prompt construction and context collection
Response parser variants (label vs type, code fences, canonical casing, top-level list)
End-to-end refine + Ctrl-C partial-result + error-tolerant batching

Full suite: all existing tests still pass; ruff clean.

Known limits / future work

Response time on local models: Small models (4B) take ~5-10s per batch; a corpus with 100+ candidates takes several minutes. The progress bar mitigates the UX impact; user can Ctrl-C if the wait is unwanted.
No streaming: Providers are called in non-streaming mode to keep the JSON parser simple. Fine for structured output; streaming would only help UX, not correctness.
Single-turn only: No multi-turn refinement where the LLM asks clarifying questions. Not needed at init time — the regex pass has already narrowed the space.
Context is line-based: For transcripts where one "line" is a whole message, we may truncate aggressively. Acceptable tradeoff for token bounds; an improvement would be token-window sampling.

Test plan

uv run pytest tests/ --ignore=tests/benchmarks — full suite passes
ruff check mempalace/ tests/ — clean
ruff format --check mempalace/ tests/ — clean
CLI flags parse: mempalace init --help shows all --llm-* options
Local-provider smoke test: classify a small candidate set against Ollama and verify JSON round-trip
Reviewer verification with a different model and/or provider (e.g. openai-compat to LM Studio or OpenRouter)
Soak test on a large prose corpus (transcripts) to validate scale assumptions and tune batch size if needed

Claude Code stores sessions under `~/.claude/projects/<slug>/<id>.jsonl` where `<slug>` is the original CWD with `/` replaced by `-`. That encoding is lossy — can't distinguish `foo-bar` (one segment) from `foo/bar` (two) — so slug-decoding alone produces wrong names for any hyphenated project. Fortunately, every message record carries a `cwd` field with the true path. This scanner reads one record per session to recover the accurate project name deterministically, falling back to slug-decoding only if the JSONL is malformed or empty. Output shape matches project_scanner.ProjectInfo so the discover orchestrator can union results across sources. Session count doubles as a density signal for ranking. 22 unit tests cover: root detection, cwd extraction with malformed input tolerance, fallback slug decoding, name resolution using the newest session (so renames win), and dedup when two encoded dirs resolve to the same project.

Three providers cover the useful space while keeping the zero-API default: - `ollama` (default): local models via http://localhost:11434. Works fully offline. Tag-matching check accepts both `model` and `model:latest` forms. - `openai-compat`: any /v1/chat/completions endpoint. Covers OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Together, Fireworks, and most self-hosted frameworks. API key falls back to $OPENAI_API_KEY. Endpoint normalization is forgiving about trailing `/v1`. - `anthropic`: Messages API v2023-06-01. API key falls back to $ANTHROPIC_API_KEY. Concatenates multi-block text responses. JSON mode is normalized across providers — Ollama uses `format: "json"`, OpenAI-compat uses `response_format`, Anthropic uses prompt-level instruction. Callers request JSON once; this module handles the provider-specific plumbing. No external SDK dependency; stdlib `urllib` throughout. HTTP errors are wrapped into a single `LLMError` class so callers don't need to distinguish transport, auth, and parse failures at the call site. 26 tests, all with mocked HTTP — suite runs offline with no real provider required.

Takes the candidate set produced by phase-1 detection (manifests, git authors, regex on prose) and asks an LLM to reclassify each candidate as PERSON / PROJECT / TOPIC / COMMON_WORD / AMBIGUOUS. Scale approach: never feed the raw corpus to the LLM. For each candidate, collect up to 3 context lines from sampled prose, cap each at 240 chars, batch 25 candidates per call. Keeps total input around 50-100K tokens even on large corpora and completes in a few minutes on a 4B local model. Interactive UX: - Stderr progress bar with the current candidate name, updates per-batch. - Ctrl-C interrupts cleanly: returns a RefineResult with `cancelled=True` and whatever was classified before the interrupt. The partial result is safe to pass straight to confirm_entities. - Per-batch errors (transport, parse) are recorded in `errors` and don't abort the whole run. Refinement scope: only `uncertain` and low-confidence `projects` entries are sent. Manifest-backed projects (conf >= 0.95) and git- authored people are already authoritative and skip the LLM. Response parser is defensive — accepts `label` or `type` keys, lowercase/uppercase variants, top-level list or wrapped object, and strips markdown code fences. Unknown labels become AMBIGUOUS so the user reviews them rather than silently accepting a bad classification. `collect_corpus_text` provides a simple stratified prose sampler (recent first, capped per-file) so callers don't need to build their own corpus window. 28 tests with a FakeProvider (no network). Covers context collection, prompt building, response parsing variants, classification apply, end-to-end refine, and Ctrl-C partial-result behavior.

Extends the init orchestrator to consume two new signal sources: 1. Claude Code conversation dirs: when the target is a `~/.claude/projects/` root, convo_scanner contributes ProjectInfo entries alongside the git/manifest projects. Dedup is by name, preferring the entry with more user-authored activity. 2. Optional LLM refinement: when --llm is passed, discover_entities constructs the provider, validates availability, and runs llm_refine.refine_entities on the merged candidates. Status summary (reclassified / dropped / cancelled / batch errors) prints to stderr. New init flags (opt-in, default remains zero-API): - --llm: enable refinement - --llm-provider: ollama (default) | openai-compat | anthropic - --llm-model: default gemma4:e4b for Ollama - --llm-endpoint: URL (required for openai-compat) - --llm-api-key: falls back to env ($ANTHROPIC_API_KEY or $OPENAI_API_KEY depending on provider) Provider check_available runs before the scan, so the user sees an immediate error ("Run: ollama pull <model>" or "ANTHROPIC_API_KEY not set") rather than a mid-scan failure.

Copilot

Pull request overview

Adds an opt-in, provider-pluggable LLM refinement step to improve mempalace init entity classification for prose-heavy corpora, and introduces a deterministic Claude Code conversation scanner to extract project names from ~/.claude/projects/ sessions (using per-session cwd metadata rather than lossy slug decoding).

Changes:

Add mempalace init --llm with provider/model/endpoint/api-key flags and provider availability checks.
Introduce mempalace/llm_client.py (Ollama, OpenAI-compatible, Anthropic) and mempalace/llm_refine.py (batching, context sampling, robust parsing, Ctrl‑C partial results).
Add mempalace/convo_scanner.py and wire it into discover_entities() for .claude/projects/ roots; expand unit tests to cover new modules.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
uv.lock	Adds `tomli` marker dependency for Python < 3.11.
mempalace/cli.py	Wires new `--llm*` flags into `init` and constructs/checks the provider before scanning.
mempalace/project_scanner.py	Extends `discover_entities()` to optionally scan Claude projects roots and run LLM refinement.
mempalace/convo_scanner.py	New scanner for `~/.claude/projects/` sessions, extracting project names from JSONL `cwd`.
mempalace/llm_client.py	New minimal HTTP-only provider abstraction (ollama/openai-compat/anthropic) + availability probes.
mempalace/llm_refine.py	New refinement pass: batching, context collection, response parsing, merge logic, and corpus sampling.
tests/test_llm_client.py	Unit tests for provider factory, HTTP wrapper, and provider request/response handling.
tests/test_llm_refine.py	Unit tests for prompt/context building, response parsing variants, merging, batching, and Ctrl‑C partials.
tests/test_convo_scanner.py	Unit tests for Claude projects root detection, `cwd` extraction, slug fallback, and dedup/ranking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T03:51:20Z

+    sessions = sorted(
+        (p for p in project_dir.iterdir() if p.is_file() and p.suffix == ".jsonl"),
+        key=lambda p: p.stat().st_mtime,
+        reverse=True,  # newest first — most likely to be well-formed
+    )


_resolve_project_name() sorts session files by p.stat().st_mtime without handling OSError. If a .jsonl file is unreadable, broken, or permission-restricted, this will raise during sorting and can abort scanning the entire Claude projects tree. Consider wrapping stat() in a safe helper (e.g., defaulting mtime to 0 on error) or filtering out paths that fail stat before sorting.

…oritative sources Addresses issues found while reviewing the initial phase-2 implementation against real data: **Bug: uncertain bucket starved from the LLM.** `discover_entities` was dropping the regex-uncertain bucket whenever real git/manifest signal existed — which is exactly when `--llm` is most useful for cleaning up prose noise. The uncertain candidates never reached the refinement step. Fixed: only drop when `llm_provider is None`. **Context collection: word boundaries, not substring.** `_collect_contexts` used substring matching on lower-cased lines, so the name "Go" matched "good", "going", "forgot". Switched to a `(?<!\w)…(?!\w)` regex so short names only match at token boundaries. **Authoritative-source detection replaces confidence threshold.** Previously the refinement step skipped entries with `confidence >= 0.95` to avoid second-guessing manifest-backed projects. That threshold was fragile — the regex detector produces 0.99 confidence for things like `code file reference (5x)` on framework names (OpenAPI, etc.), so those skipped the LLM despite being regex-only noise. New helpers `_is_authoritative_person` / `_is_authoritative_project` look at the actual signal strings (commits, package.json, etc.) to decide. **Now also refines regex-derived people.** After #1148's high-pronoun-signal fix, the regex detector can promote non-people to the `people` bucket (e.g. a capitalized common noun that happened to appear near pronouns). The LLM now gets a chance to clean those up, while git-authored people are still skipped. **Robust JSON extraction.** Small local models routinely wrap JSON output in prose ("Sure, here's the classification: {…}"). The previous code-fence stripper failed on that. `_extract_json_candidates` now does balanced-bracket extraction with string-aware quote handling, so it recovers JSON from: - raw responses - markdown fenced blocks - JSON embedded inside surrounding text - multiple candidate objects/arrays **Prompt guidance for frameworks vs user projects.** Added an explicit instruction: frameworks, runtimes, APIs, cloud services, and third-party vendors (Angular, OpenAPI, Terraform, Bun, Google, etc.) are TOPIC unless the context clearly says it's the user's own codebase. Directly addresses a false-positive pattern observed during dev runs. **Defensive mtime.** `convo_scanner._safe_mtime` catches OSError during `stat()` — permission changes, filesystem races, broken symlinks — and sorts the affected file to the end of the newest-first order rather than crashing the scan. **Cosmetic:** merged two adjacent f-strings on the same line in `backends/chroma.py` and `llm_client.py` (no behaviour change). 15 new tests cover the OSError fallback, word-boundary matching, JSON extraction variants, authoritative-source helpers, refining high- confidence regex projects, and end-to-end LLM refinement preserving the uncertain bucket.

…-develop chore: rescue merged stacked PRs #1150 and #1157 into develop

…to develop MemPalace#1148, MemPalace#1150, and MemPalace#1157 were reviewed and merged on GitHub, but the two stacked children landed on their parent feature branches (now stale) rather than on develop. Only MemPalace#1148's commits reached develop via the direct merge. Release PR MemPalace#1159 (develop → main for v3.3.3) is therefore missing the LLM refinement, Claude-conversation scanner, and miner- registry wire-up that were ostensibly part of the release. This merge brings the stale `feat/llm-entity-refine` branch (which contains the rolled-up merge commit for MemPalace#1157 → MemPalace#1150 → everything below) into develop so the release tag includes it. No code changes here — only history recovery.

Adds entries to the 3.3.3 section for the work that landed via MemPalace#1148, MemPalace#1150, MemPalace#1157, and MemPalace#1175 (rescued from stacked feature branches into develop via MemPalace#1175). Without these entries the 3.3.3 release notes on main would advertise only the hook/diary/search fixes that made it to develop through the first direct merge. Covers: - Manifest + git-author entity detection (MemPalace#1148) - Regex detector accuracy improvements (MemPalace#1148) - Optional --llm classification with Ollama / openai-compat / Anthropic provider abstraction and interactive UX (MemPalace#1150) - Claude Code conversation scanner (MemPalace#1150) - Init → miner registry wire-up so confirmed entities actually reach drawer metadata tagging (MemPalace#1157) - Case-insensitive project dedup across all sources (MemPalace#1175) - `mempalace mine` skips the generated entities.json artifact

igorls added 4 commits April 24, 2026 00:46

Copilot AI review requested due to automatic review settings April 24, 2026 03:47

igorls requested review from bensig and milla-jovovich as code owners April 24, 2026 03:47

Copilot started reviewing on behalf of igorls April 24, 2026 03:48 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

igorls added 2 commits April 24, 2026 01:30

fix(mine): skip generated entities file

b150d33

igorls mentioned this pull request Apr 24, 2026

feat(init): wire confirmed entities into the miner's known-entities registry #1157

Merged

5 tasks

igorls added the enhancement New feature or request label Apr 24, 2026

igorls merged commit 47c185e into feat/project-scanner-entity-detection Apr 24, 2026

igorls mentioned this pull request Apr 24, 2026

chore: rescue merged stacked PRs #1150 and #1157 into develop #1175

Merged

3 tasks

igorls added a commit that referenced this pull request Apr 24, 2026

Merge pull request #1175 from MemPalace/chore/rescue-stacked-prs-into…

8a6ebbe

…-develop chore: rescue merged stacked PRs #1150 and #1157 into develop

igorls mentioned this pull request Apr 24, 2026

docs(changelog): document init entity-detection overhaul in 3.3.3 #1176

Merged

3 tasks

igorls mentioned this pull request Apr 24, 2026

feat(graph): cross-wing tunnels by shared topics #1180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(init): optional LLM-assisted entity refinement + Claude Code convo scanner (phase 2)#1150

feat(init): optional LLM-assisted entity refinement + Claude Code convo scanner (phase 2)#1150
igorls merged 6 commits intofeat/project-scanner-entity-detectionfrom
feat/llm-entity-refine

igorls commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

igorls commented Apr 24, 2026

Summary

Why stacked on #1148

What's in this PR

Tests

Known limits / future work

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants