|
2 | 2 |
|
3 | 3 | **JP's production fork of [milla-jovovich/mempalace](https://github.com/milla-jovovich/mempalace)** |
4 | 4 |
|
5 | | -[](https://github.com/MemPalace/mempalace/releases) |
| 5 | +[](https://github.com/jphein/mempalace/releases) [](https://github.com/MemPalace/mempalace/releases) |
6 | 6 | [](https://www.python.org/) |
7 | 7 | [](LICENSE) |
8 | 8 |
|
@@ -313,27 +313,24 @@ What didn't work: SessionStart pre-loading, auto-memory bridges, PreCompact re-r |
313 | 313 |
|
314 | 314 | P1's cwd-derived wings are relevant here: once wings are derived from unambiguous signals, they become a cheap scoping prior for any automatic surfacing mechanism. "Claude is in `/Projects/mempalace`; query that wing first" is a lot cheaper than training a router. No memory system has solved this well — it's the unsolved problem of the [OSS Insight Agent Memory Race](https://ossinsight.io/blog/agent-memory-race-2026). |
315 | 315 |
|
316 | | -### Multi-client coordination — v3.3.4 defense-in-depth landed; palace-daemon integration in progress |
| 316 | +### Multi-client coordination — v3.3.4 fixes landed; palace-daemon deferred pending observation |
317 | 317 |
|
318 | 318 | Several users have hit the "multiple clients hammering one palace" pattern — @worktarget's #904 report, the ChromaDB concurrency family in #357 / #521 / #832, and the multi-machine case (laptop → home server palace). The core problem: Claude Code spawns one `mcp_server.py` per open terminal; stop hooks spawn additional short-lived writers (diary writes, `mempalace mine` subprocesses). All open independent `PersistentClient` instances against the same palace directory. ChromaDB has no inter-process write locking; concurrent `col.add/upsert/update/delete` from N processes corrupts the HNSW segment, causing the next read to SIGSEGV in `chromadb_rust_bindings`. |
319 | 319 |
|
320 | | -**v3.3.4 defense-in-depth (landed 2026-04-24):** three fixes filed as [#1171](https://github.com/milla-jovovich/mempalace/pull/1171), [#1173](https://github.com/milla-jovovich/mempalace/pull/1173), [#1177](https://github.com/milla-jovovich/mempalace/pull/1177): |
| 320 | +The actual root cause was traced upstream in [#974](https://github.com/MemPalace/mempalace/issues/974) / [#965](https://github.com/MemPalace/mempalace/issues/965): ChromaDB's multi-threaded `ParallelFor` HNSW insert path races in `repairConnectionsForUpdate` / `addPoint`, corrupting the graph even within a single process. Without `hnsw:num_threads: 1` pinned at collection creation, the race produces runaway writes to `link_lists.bin` — observed at 437 GB on this fork's 135K-drawer palace, 1.5 TB on a Nobara install in [#976](https://github.com/milla-jovovich/mempalace/pull/976). |
321 | 321 |
|
322 | | -1. **Backend-seam flock (#1171):** `_palace_write_lock(palace_path)` wraps `ChromaCollection.add/upsert/update/delete`. RFC 001 made the adapter the single boundary for all ChromaDB writes, so the lock there covers every caller (mcp_server, miner, convo_miner, palace) automatically. First attempt wrapped the four write sites in `mcp_server.py` directly but missed the `mempalace mine` subprocesses the hook spawns; redirected to the adapter layer. `flock` auto-releases on process death so a mid-write crash cannot deadlock future writers. Unix-only — Windows is a no-op. |
323 | | -2. **Quarantine on open (#1173):** `quarantine_stale_hnsw()` now runs inside `ChromaBackend.make_client()` itself (complementary to #1062 which covers server startup). Threshold lowered 3600→300s after a production 0.96h-drift segfault. |
324 | | -3. **Marker guard (#1177):** `.blob_seq_ids_migrated` sentinel file skips `sqlite3.connect()` on already-migrated palaces — opening sqlite against a live ChromaDB 1.5.x WAL database corrupts the next `PersistentClient`. Closes #1090. |
| 322 | +**v3.3.4 fixes (landed 2026-04-24):** four changes — three filed upstream as our PRs, one cherry-picked from @felipetruman's #976: |
325 | 323 |
|
326 | | -**Primary concurrency story: palace-daemon integration (in progress).** The v3.3.4 fixes make direct-access palaces survivable, but the architecturally correct answer is to stop having N clients touch the database in the first place. [palace-daemon](https://github.com/rboarescu/palace-daemon) (@rboarescu) is a FastAPI gateway with three asyncio semaphores (read N concurrent / write N/2 concurrent / mine 1 at a time) where the daemon is the only process that opens the palace; clients connect over HTTP via `mempalace-mcp.py` (a stdlib-only MCP proxy) and `hook.py` (a stdlib-only hook runner). A per-port file lock at `/tmp/palace-daemon-8085.lock` enforces one daemon per host+port; the client is hard-coded to fail if the daemon is unreachable, deliberately eliminating split-brain. |
| 324 | +1. **`hnsw:num_threads: 1` pin (cherry-picked from [#976](https://github.com/milla-jovovich/mempalace/pull/976)):** the actual root-cause fix. Disables `ParallelFor` so HNSW inserts serialize within each process. Applied at collection creation metadata + via `_pin_hnsw_threads()` on every `get_collection` (ChromaDB 1.5.x doesn't persist the modified config across reopens). Posted [reproduction data](https://github.com/milla-jovovich/mempalace/pull/976#issuecomment-4316741161) on #976; this fork-local cherry-pick will become a no-op when #976 merges upstream. |
| 325 | +2. **Backend-seam flock ([#1171](https://github.com/milla-jovovich/mempalace/pull/1171)):** `_palace_write_lock(palace_path)` wraps `ChromaCollection.add/upsert/update/delete`. RFC 001 made the adapter the single boundary for all ChromaDB writes, so the lock there covers every caller (mcp_server, miner, convo_miner, palace) automatically. Defends against the race even before the num_threads pin takes effect on first open. Unix-only — Windows is a no-op. |
| 326 | +3. **Quarantine on open ([#1173](https://github.com/milla-jovovich/mempalace/pull/1173)):** `quarantine_stale_hnsw()` now runs inside `ChromaBackend.make_client()` itself (complementary to #1062 which covers server startup). Threshold lowered 3600→300s after a 0.96h-drift segfault. |
| 327 | +4. **Marker guard ([#1177](https://github.com/milla-jovovich/mempalace/pull/1177)):** `.blob_seq_ids_migrated` sentinel file skips `sqlite3.connect()` on already-migrated palaces — opening sqlite against a live ChromaDB 1.5.x WAL database corrupts the next `PersistentClient`. Closes #1090. |
327 | 328 |
|
328 | | -palace-daemon pins its correctness floor at MemPalace ≥3.3.2, which aligns with our v3.3.4 reliability stack — the two are compositional. Our flock + quarantine + marker guards continue to matter inside the daemon process (and for anyone running without the daemon), but the daemon's single-process design makes same-machine concurrent-write corruption impossible at the architecture level, and also solves multi-machine access (palace on a home server, clients over LAN) and Windows (where `flock` is a no-op). |
| 329 | +#1 is the *fix*; #2/#3/#4 are defense-in-depth around symptoms (corruption containment, drift recovery, sqlite-state isolation). Together they should eliminate the segfault class for direct-access palaces. |
329 | 330 |
|
330 | | -**Current integration shape (JP's fork, local-only):** repo cloned at `~/Projects/palace-daemon`; daemon to run as systemd `--user` service pointing at `~/Projects/mempalace-data/palace/`; Claude Code MCP config rewired from direct `mempalace-mcp` stdio to daemon's `mempalace-mcp.py` HTTP client; `.claude-plugin/hooks/mempal-{stop,precompact}-hook.sh` swapped for `clients/hook.py`. A per-port lock at `/tmp/palace-daemon-8085.lock` enforces one daemon per host+port. Not changing the plugin marketplace default yet — this is JP's personal-install configuration while we validate the swap. If it holds for ~a week, we'll evaluate shipping an opt-in daemon-mode in the marketplace plugin. |
| 331 | +**palace-daemon — deferred pending observation.** [palace-daemon](https://github.com/rboarescu/palace-daemon) (@rboarescu) is a FastAPI gateway with three asyncio semaphores (read N concurrent / write N/2 concurrent / mine 1 at a time) where the daemon is the only process that opens the palace; clients connect over HTTP via `mempalace-mcp.py` (a stdlib-only MCP proxy) and `hook.py` (a stdlib-only hook runner). A per-port file lock at `/tmp/palace-daemon-8085.lock` enforces one daemon per host+port; the client is hard-coded to fail if the daemon is unreachable, deliberately eliminating split-brain. Previously framed here as the "primary concurrency story". With #976's root-cause fix now in our fork, the urgency is materially lower: same-machine concurrent corruption should no longer occur. Repo is cloned at `~/Projects/palace-daemon` and integration is well-scoped if needed (systemd `--user`, swap MCP/hook configs to daemon clients, no plugin marketplace change), but the work is on hold until we observe whether the v3.3.4 stack is genuinely stable in production (~1 week). The daemon remains the right answer for multi-machine access (palace on a home server, clients over LAN) and Windows (where our `flock` is a no-op) — neither is JP's current pain. |
331 | 332 |
|
332 | | -**Postgres + pgvector as a parallel track.** RFC 001's backend seam is merged (#413, #995) and the registry already advertises `mempalace_postgres` as the canonical entry-point example. @skuznetsov's [#665](https://github.com/milla-jovovich/mempalace/pull/665) ships the actual PostgreSQL backend implementation (`pg_sorted_heap` preferred path, `pgvector` fallback); @malakhov-dmitrii's [#1072](https://github.com/milla-jovovich/mempalace/pull/1072) wires `palace._DEFAULT_BACKEND` through the registry so `MEMPALACE_BACKEND=postgres` actually takes effect. When both land, switching is `pip install mempalace-postgres && export MEMPALACE_BACKEND=postgres` — no backend authoring needed on our side. |
333 | | - |
334 | | -Postgres would eliminate the entire ChromaDB 1.5.x failure class (MVCC for concurrent writes, no HNSW drift, no sqlite3.connect corruption, no Rust-binding segfaults). Migrating 135K+ existing drawers off ChromaDB is a real cost but not code we'd write — `export_palace()` + a Postgres importer against the same backend interface covers it. The remaining question is ordering: palace-daemon is deployable today and wraps the current palace; Postgres needs #665 to land (currently `CONFLICTING`, needs rebase after today's develop merges) plus #1072. Starting with palace-daemon gives us a working multi-client story now and doesn't preclude Postgres later — the daemon is storage-agnostic. |
335 | | - |
336 | | -bensig's upcoming TypeScript rewrite (announced in Discord) will pick its own storage layer independent of either path, so "wait on TS" remains an option only if the v3.3.4 defense-in-depth proves fully stable in practice. |
| 333 | +**Postgres + pgvector — long-term option, no immediate move.** RFC 001's backend seam is merged (#413, #995) and the registry already advertises `mempalace_postgres` as the canonical entry-point example. @skuznetsov's [#665](https://github.com/milla-jovovich/mempalace/pull/665) ships the actual PostgreSQL backend implementation (`pg_sorted_heap` preferred path, `pgvector` fallback); @malakhov-dmitrii's [#1072](https://github.com/milla-jovovich/mempalace/pull/1072) wires `palace._DEFAULT_BACKEND` through the registry so `MEMPALACE_BACKEND=postgres` actually takes effect. When both land, switching is `pip install mempalace-postgres && export MEMPALACE_BACKEND=postgres`. Postgres would eliminate the entire ChromaDB 1.5.x failure class natively (MVCC, no HNSW drift, no Rust-binding segfaults), but with the v3.3.4 stack now mitigating that class for direct-access palaces, the migration cost (135K+ drawers off ChromaDB via `export_palace()` + a Postgres importer) isn't justified by current pain. Re-evaluate if the v3.3.4 stack proves unstable, or once bensig's TypeScript rewrite picks its own storage layer. |
337 | 334 |
|
338 | 335 | ### Stale auto-loaded docs |
339 | 336 |
|
|
0 commit comments