Skip to content

feat(init): optional LLM-assisted entity refinement + Claude Code convo scanner (phase 2)#1150

Merged
igorls merged 6 commits intofeat/project-scanner-entity-detectionfrom
feat/llm-entity-refine
Apr 24, 2026
Merged

feat(init): optional LLM-assisted entity refinement + Claude Code convo scanner (phase 2)#1150
igorls merged 6 commits intofeat/project-scanner-entity-detectionfrom
feat/llm-entity-refine

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented Apr 24, 2026

Summary

Implements #1149 — phase-2 of the init entity detection work, stacked on top of #1148.

Adds an opt-in LLM refinement step (mempalace init --llm) that takes the candidate set produced by phase-1 detection and reclassifies each entity as PERSON / PROJECT / TOPIC / COMMON_WORD / AMBIGUOUS. Default behaviour is unchanged — no LLM, no network, no API keys required.

Also adds a deterministic convo_scanner that parses ~/.claude/projects/ session directories into project entities by reading each session's cwd metadata, avoiding the lossy slug-decoding problem.

Why stacked on #1148

The two PRs build incrementally:

Reviewing them together is easier than reviewing #1149 in isolation. Merge order: #1148 first, then this one rebases onto develop.

What's in this PR

mempalace/convo_scanner.py — Parse Claude Code conversation directories into ProjectInfo.

  • Reads cwd from session JSONL records for accurate project names (slug decoding is lossy — can't distinguish foo-bar one-segment from foo/bar two-segments).
  • Falls back to slug-decoding only if JSONL is malformed.
  • Dedup by name, prefers the entry with more sessions.

mempalace/llm_client.py — Pluggable provider abstraction, no external SDKs (stdlib urllib only).

  • ollama (default local, zero-API).
  • openai-compat for OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Together, Fireworks.
  • anthropic for the Messages API.
  • JSON-mode plumbing normalized across providers.
  • Fast check_available() probe before first use.

mempalace/llm_refine.py — The refinement step.

  • Batches 25 candidates per call, collects up to 3 context lines each capped at 240 chars — bounds total input to ~50-100K tokens on any corpus size.
  • Interactive progress on stderr (overwrite-line bar with current candidate name).
  • Ctrl-C cancels cleanly and returns whatever was classified so far; partial result is safe to pass straight to confirm_entities.
  • Defensive response parser: handles label/type keys, case variants, top-level list vs wrapped object, markdown code fences. Unknown labels become AMBIGUOUS so the user reviews them.
  • Manifest-backed projects (conf >= 0.95) and git-authored people are not sent to the LLM — they're already authoritative.

CLI flags (opt-in, default zero-API):

mempalace init <dir> --llm
mempalace init <dir> --llm --llm-provider ollama --llm-model gemma4:e4b
mempalace init <dir> --llm --llm-provider openai-compat \
                     --llm-endpoint http://localhost:1234/v1 \
                     --llm-model <model>
mempalace init <dir> --llm --llm-provider anthropic \
                     --llm-model claude-haiku-4-5

API keys fall back to $OPENAI_API_KEY / $ANTHROPIC_API_KEY when not passed explicitly.

Tests

76 new unit tests across the three new modules, all running offline with mocked HTTP and a FakeProvider. Covers:

  • Manifest/JSONL parsing with malformed-input tolerance
  • Every provider's JSON-mode plumbing and error paths
  • Prompt construction and context collection
  • Response parser variants (label vs type, code fences, canonical casing, top-level list)
  • End-to-end refine + Ctrl-C partial-result + error-tolerant batching

Full suite: all existing tests still pass; ruff clean.

Known limits / future work

  • Response time on local models: Small models (4B) take ~5-10s per batch; a corpus with 100+ candidates takes several minutes. The progress bar mitigates the UX impact; user can Ctrl-C if the wait is unwanted.
  • No streaming: Providers are called in non-streaming mode to keep the JSON parser simple. Fine for structured output; streaming would only help UX, not correctness.
  • Single-turn only: No multi-turn refinement where the LLM asks clarifying questions. Not needed at init time — the regex pass has already narrowed the space.
  • Context is line-based: For transcripts where one "line" is a whole message, we may truncate aggressively. Acceptable tradeoff for token bounds; an improvement would be token-window sampling.

Test plan

  • uv run pytest tests/ --ignore=tests/benchmarks — full suite passes
  • ruff check mempalace/ tests/ — clean
  • ruff format --check mempalace/ tests/ — clean
  • CLI flags parse: mempalace init --help shows all --llm-* options
  • Local-provider smoke test: classify a small candidate set against Ollama and verify JSON round-trip
  • Reviewer verification with a different model and/or provider (e.g. openai-compat to LM Studio or OpenRouter)
  • Soak test on a large prose corpus (transcripts) to validate scale assumptions and tune batch size if needed

igorls added 4 commits April 24, 2026 00:46
Claude Code stores sessions under `~/.claude/projects/<slug>/<id>.jsonl`
where `<slug>` is the original CWD with `/` replaced by `-`. That
encoding is lossy — can't distinguish `foo-bar` (one segment) from
`foo/bar` (two) — so slug-decoding alone produces wrong names for any
hyphenated project.

Fortunately, every message record carries a `cwd` field with the true
path. This scanner reads one record per session to recover the
accurate project name deterministically, falling back to slug-decoding
only if the JSONL is malformed or empty.

Output shape matches project_scanner.ProjectInfo so the discover
orchestrator can union results across sources. Session count doubles
as a density signal for ranking.

22 unit tests cover: root detection, cwd extraction with malformed
input tolerance, fallback slug decoding, name resolution using the
newest session (so renames win), and dedup when two encoded dirs
resolve to the same project.
Three providers cover the useful space while keeping the zero-API
default:

- `ollama` (default): local models via http://localhost:11434. Works
  fully offline. Tag-matching check accepts both `model` and
  `model:latest` forms.
- `openai-compat`: any /v1/chat/completions endpoint. Covers
  OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Together,
  Fireworks, and most self-hosted frameworks. API key falls back to
  $OPENAI_API_KEY. Endpoint normalization is forgiving about trailing
  `/v1`.
- `anthropic`: Messages API v2023-06-01. API key falls back to
  $ANTHROPIC_API_KEY. Concatenates multi-block text responses.

JSON mode is normalized across providers — Ollama uses
`format: "json"`, OpenAI-compat uses `response_format`, Anthropic uses
prompt-level instruction. Callers request JSON once; this module
handles the provider-specific plumbing.

No external SDK dependency; stdlib `urllib` throughout. HTTP errors
are wrapped into a single `LLMError` class so callers don't need to
distinguish transport, auth, and parse failures at the call site.

26 tests, all with mocked HTTP — suite runs offline with no real
provider required.
Takes the candidate set produced by phase-1 detection (manifests, git
authors, regex on prose) and asks an LLM to reclassify each candidate
as PERSON / PROJECT / TOPIC / COMMON_WORD / AMBIGUOUS.

Scale approach: never feed the raw corpus to the LLM. For each
candidate, collect up to 3 context lines from sampled prose, cap each
at 240 chars, batch 25 candidates per call. Keeps total input around
50-100K tokens even on large corpora and completes in a few minutes
on a 4B local model.

Interactive UX:
- Stderr progress bar with the current candidate name, updates
  per-batch.
- Ctrl-C interrupts cleanly: returns a RefineResult with
  `cancelled=True` and whatever was classified before the interrupt.
  The partial result is safe to pass straight to confirm_entities.
- Per-batch errors (transport, parse) are recorded in `errors` and
  don't abort the whole run.

Refinement scope: only `uncertain` and low-confidence `projects`
entries are sent. Manifest-backed projects (conf >= 0.95) and git-
authored people are already authoritative and skip the LLM.

Response parser is defensive — accepts `label` or `type` keys,
lowercase/uppercase variants, top-level list or wrapped object, and
strips markdown code fences. Unknown labels become AMBIGUOUS so the
user reviews them rather than silently accepting a bad classification.

`collect_corpus_text` provides a simple stratified prose sampler
(recent first, capped per-file) so callers don't need to build their
own corpus window.

28 tests with a FakeProvider (no network). Covers context collection,
prompt building, response parsing variants, classification apply,
end-to-end refine, and Ctrl-C partial-result behavior.
Extends the init orchestrator to consume two new signal sources:

1. Claude Code conversation dirs: when the target is a
   `~/.claude/projects/` root, convo_scanner contributes ProjectInfo
   entries alongside the git/manifest projects. Dedup is by name,
   preferring the entry with more user-authored activity.
2. Optional LLM refinement: when --llm is passed, discover_entities
   constructs the provider, validates availability, and runs
   llm_refine.refine_entities on the merged candidates. Status
   summary (reclassified / dropped / cancelled / batch errors)
   prints to stderr.

New init flags (opt-in, default remains zero-API):
- --llm: enable refinement
- --llm-provider: ollama (default) | openai-compat | anthropic
- --llm-model: default gemma4:e4b for Ollama
- --llm-endpoint: URL (required for openai-compat)
- --llm-api-key: falls back to env ($ANTHROPIC_API_KEY or
  $OPENAI_API_KEY depending on provider)

Provider check_available runs before the scan, so the user sees an
immediate error ("Run: ollama pull <model>" or "ANTHROPIC_API_KEY not
set") rather than a mid-scan failure.
Copilot AI review requested due to automatic review settings April 24, 2026 03:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in, provider-pluggable LLM refinement step to improve mempalace init entity classification for prose-heavy corpora, and introduces a deterministic Claude Code conversation scanner to extract project names from ~/.claude/projects/ sessions (using per-session cwd metadata rather than lossy slug decoding).

Changes:

  • Add mempalace init --llm with provider/model/endpoint/api-key flags and provider availability checks.
  • Introduce mempalace/llm_client.py (Ollama, OpenAI-compatible, Anthropic) and mempalace/llm_refine.py (batching, context sampling, robust parsing, Ctrl‑C partial results).
  • Add mempalace/convo_scanner.py and wire it into discover_entities() for .claude/projects/ roots; expand unit tests to cover new modules.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
uv.lock Adds tomli marker dependency for Python < 3.11.
mempalace/cli.py Wires new --llm* flags into init and constructs/checks the provider before scanning.
mempalace/project_scanner.py Extends discover_entities() to optionally scan Claude projects roots and run LLM refinement.
mempalace/convo_scanner.py New scanner for ~/.claude/projects/ sessions, extracting project names from JSONL cwd.
mempalace/llm_client.py New minimal HTTP-only provider abstraction (ollama/openai-compat/anthropic) + availability probes.
mempalace/llm_refine.py New refinement pass: batching, context collection, response parsing, merge logic, and corpus sampling.
tests/test_llm_client.py Unit tests for provider factory, HTTP wrapper, and provider request/response handling.
tests/test_llm_refine.py Unit tests for prompt/context building, response parsing variants, merging, batching, and Ctrl‑C partials.
tests/test_convo_scanner.py Unit tests for Claude projects root detection, cwd extraction, slug fallback, and dedup/ranking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +99 to +103
sessions = sorted(
(p for p in project_dir.iterdir() if p.is_file() and p.suffix == ".jsonl"),
key=lambda p: p.stat().st_mtime,
reverse=True, # newest first — most likely to be well-formed
)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_resolve_project_name() sorts session files by p.stat().st_mtime without handling OSError. If a .jsonl file is unreadable, broken, or permission-restricted, this will raise during sorting and can abort scanning the entire Claude projects tree. Consider wrapping stat() in a safe helper (e.g., defaulting mtime to 0 on error) or filtering out paths that fail stat before sorting.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/llm_refine.py Outdated
Comment thread mempalace/llm_refine.py Outdated
igorls added 2 commits April 24, 2026 01:30
…oritative sources

Addresses issues found while reviewing the initial phase-2 implementation
against real data:

**Bug: uncertain bucket starved from the LLM.**
`discover_entities` was dropping the regex-uncertain bucket whenever real
git/manifest signal existed — which is exactly when `--llm` is most useful
for cleaning up prose noise. The uncertain candidates never reached the
refinement step. Fixed: only drop when `llm_provider is None`.

**Context collection: word boundaries, not substring.**
`_collect_contexts` used substring matching on lower-cased lines, so the
name "Go" matched "good", "going", "forgot". Switched to a
`(?<!\w)…(?!\w)` regex so short names only match at token boundaries.

**Authoritative-source detection replaces confidence threshold.**
Previously the refinement step skipped entries with `confidence >= 0.95`
to avoid second-guessing manifest-backed projects. That threshold was
fragile — the regex detector produces 0.99 confidence for things like
`code file reference (5x)` on framework names (OpenAPI, etc.), so those
skipped the LLM despite being regex-only noise. New helpers
`_is_authoritative_person` / `_is_authoritative_project` look at the
actual signal strings (commits, package.json, etc.) to decide.

**Now also refines regex-derived people.**
After #1148's high-pronoun-signal fix, the regex detector can promote
non-people to the `people` bucket (e.g. a capitalized common noun that
happened to appear near pronouns). The LLM now gets a chance to clean
those up, while git-authored people are still skipped.

**Robust JSON extraction.**
Small local models routinely wrap JSON output in prose ("Sure, here's
the classification: {…}"). The previous code-fence stripper failed on
that. `_extract_json_candidates` now does balanced-bracket extraction
with string-aware quote handling, so it recovers JSON from:
- raw responses
- markdown fenced blocks
- JSON embedded inside surrounding text
- multiple candidate objects/arrays

**Prompt guidance for frameworks vs user projects.**
Added an explicit instruction: frameworks, runtimes, APIs, cloud
services, and third-party vendors (Angular, OpenAPI, Terraform, Bun,
Google, etc.) are TOPIC unless the context clearly says it's the user's
own codebase. Directly addresses a false-positive pattern observed
during dev runs.

**Defensive mtime.**
`convo_scanner._safe_mtime` catches OSError during `stat()` — permission
changes, filesystem races, broken symlinks — and sorts the affected file
to the end of the newest-first order rather than crashing the scan.

**Cosmetic:** merged two adjacent f-strings on the same line in
`backends/chroma.py` and `llm_client.py` (no behaviour change).

15 new tests cover the OSError fallback, word-boundary matching, JSON
extraction variants, authoritative-source helpers, refining high-
confidence regex projects, and end-to-end LLM refinement preserving the
uncertain bucket.
@igorls igorls added the enhancement New feature or request label Apr 24, 2026
@igorls igorls merged commit 47c185e into feat/project-scanner-entity-detection Apr 24, 2026
igorls added a commit that referenced this pull request Apr 24, 2026
…-develop

chore: rescue merged stacked PRs #1150 and #1157 into develop
shrhoads pushed a commit to shrhoads/mempalace that referenced this pull request Apr 24, 2026
…to develop

MemPalace#1148, MemPalace#1150, and MemPalace#1157 were reviewed and merged on GitHub, but the two
stacked children landed on their parent feature branches (now stale)
rather than on develop. Only MemPalace#1148's commits reached develop via the
direct merge. Release PR MemPalace#1159 (develop → main for v3.3.3) is therefore
missing the LLM refinement, Claude-conversation scanner, and miner-
registry wire-up that were ostensibly part of the release.

This merge brings the stale `feat/llm-entity-refine` branch (which
contains the rolled-up merge commit for MemPalace#1157MemPalace#1150 → everything
below) into develop so the release tag includes it.

No code changes here — only history recovery.
shrhoads pushed a commit to shrhoads/mempalace that referenced this pull request Apr 24, 2026
Adds entries to the 3.3.3 section for the work that landed via MemPalace#1148,
MemPalace#1150, MemPalace#1157, and MemPalace#1175 (rescued from stacked feature branches into
develop via MemPalace#1175). Without these entries the 3.3.3 release notes on
main would advertise only the hook/diary/search fixes that made it to
develop through the first direct merge.

Covers:
- Manifest + git-author entity detection (MemPalace#1148)
- Regex detector accuracy improvements (MemPalace#1148)
- Optional --llm classification with Ollama / openai-compat / Anthropic
  provider abstraction and interactive UX (MemPalace#1150)
- Claude Code conversation scanner (MemPalace#1150)
- Init → miner registry wire-up so confirmed entities actually reach
  drawer metadata tagging (MemPalace#1157)
- Case-insensitive project dedup across all sources (MemPalace#1175)
- `mempalace mine` skips the generated entities.json artifact
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants