docs(readme): reflect num_threads cherry-pick + defer palace-daemon/Postgres

jphein · jphein · commit 97fb26ebdf10 · 2026-04-24T15:33:36.000-07:00
Three things merged into one README pass: 1. Badge: link version-3.3.4 to jphein/mempalace/releases (the v3.3.4 tag we just pushed) and add an upstream-3.3.3 secondary badge so readers can tell fork vs upstream version at a glance. Was sitting uncommitted from earlier today. 2. Multi-client coordination section: replaced the three-fix v3.3.4 summary with a four-fix one. Added @felipetruman's MemPalace#976 num_threads pin (cherry-picked at 552a0d7) as fix #1 — the actual root-cause fix. Reframed our MemPalace#1171/MemPalace#1173/MemPalace#1177 as defense-in-depth around symptoms. Walked back palace-daemon from "primary concurrency story in progress" to "deferred pending observation" — with MemPalace#976's fix in place, the daemon's same-machine value drops; multi-machine and Windows remain its differentiators but neither is current pain. 3. Postgres + pgvector: walked back from "parallel track" framing to "long-term option, no immediate move" for the same reason. Migration cost stays real, current pain is mitigated, decision deferred until v3.3.4 stack is observed in production or TS rewrite ships. Removed two stale paragraphs that were left over from the previous "daemon as primary" framing.
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 **JP's production fork of [milla-jovovich/mempalace](https://github.com/milla-jovovich/mempalace)**
 
-[![version-shield](https://img.shields.io/badge/version-3.3.4-4dc9f6?style=flat-square&labelColor=0a0e14)](https://github.com/MemPalace/mempalace/releases)
+[![version-shield](https://img.shields.io/badge/version-3.3.4-4dc9f6?style=flat-square&labelColor=0a0e14)](https://github.com/jphein/mempalace/releases) [![upstream-shield](https://img.shields.io/badge/upstream-3.3.3-7dd8f8?style=flat-square&labelColor=0a0e14)](https://github.com/MemPalace/mempalace/releases)
 [![python-shield](https://img.shields.io/badge/python-3.9+-7dd8f8?style=flat-square&labelColor=0a0e14&logo=python&logoColor=7dd8f8)](https://www.python.org/)
 [![license-shield](https://img.shields.io/badge/license-MIT-b0e8ff?style=flat-square&labelColor=0a0e14)](LICENSE)
 
@@ -313,27 +313,24 @@ What didn't work: SessionStart pre-loading, auto-memory bridges, PreCompact re-r
 
 P1's cwd-derived wings are relevant here: once wings are derived from unambiguous signals, they become a cheap scoping prior for any automatic surfacing mechanism. "Claude is in `/Projects/mempalace`; query that wing first" is a lot cheaper than training a router. No memory system has solved this well — it's the unsolved problem of the [OSS Insight Agent Memory Race](https://ossinsight.io/blog/agent-memory-race-2026).
 
-### Multi-client coordination — v3.3.4 defense-in-depth landed; palace-daemon integration in progress
+### Multi-client coordination — v3.3.4 fixes landed; palace-daemon deferred pending observation
 
 Several users have hit the "multiple clients hammering one palace" pattern — @worktarget's #904 report, the ChromaDB concurrency family in #357 / #521 / #832, and the multi-machine case (laptop → home server palace). The core problem: Claude Code spawns one `mcp_server.py` per open terminal; stop hooks spawn additional short-lived writers (diary writes, `mempalace mine` subprocesses). All open independent `PersistentClient` instances against the same palace directory. ChromaDB has no inter-process write locking; concurrent `col.add/upsert/update/delete` from N processes corrupts the HNSW segment, causing the next read to SIGSEGV in `chromadb_rust_bindings`.
 
-**v3.3.4 defense-in-depth (landed 2026-04-24):** three fixes filed as [#1171](https://github.com/milla-jovovich/mempalace/pull/1171), [#1173](https://github.com/milla-jovovich/mempalace/pull/1173), [#1177](https://github.com/milla-jovovich/mempalace/pull/1177):
+The actual root cause was traced upstream in [#974](https://github.com/MemPalace/mempalace/issues/974) / [#965](https://github.com/MemPalace/mempalace/issues/965): ChromaDB's multi-threaded `ParallelFor` HNSW insert path races in `repairConnectionsForUpdate` / `addPoint`, corrupting the graph even within a single process. Without `hnsw:num_threads: 1` pinned at collection creation, the race produces runaway writes to `link_lists.bin` — observed at 437 GB on this fork's 135K-drawer palace, 1.5 TB on a Nobara install in [#976](https://github.com/milla-jovovich/mempalace/pull/976).
 
-1. **Backend-seam flock (#1171):** `_palace_write_lock(palace_path)` wraps `ChromaCollection.add/upsert/update/delete`. RFC 001 made the adapter the single boundary for all ChromaDB writes, so the lock there covers every caller (mcp_server, miner, convo_miner, palace) automatically. First attempt wrapped the four write sites in `mcp_server.py` directly but missed the `mempalace mine` subprocesses the hook spawns; redirected to the adapter layer. `flock` auto-releases on process death so a mid-write crash cannot deadlock future writers. Unix-only — Windows is a no-op.
-2. **Quarantine on open (#1173):** `quarantine_stale_hnsw()` now runs inside `ChromaBackend.make_client()` itself (complementary to #1062 which covers server startup). Threshold lowered 3600→300s after a production 0.96h-drift segfault.
-3. **Marker guard (#1177):** `.blob_seq_ids_migrated` sentinel file skips `sqlite3.connect()` on already-migrated palaces — opening sqlite against a live ChromaDB 1.5.x WAL database corrupts the next `PersistentClient`. Closes #1090.
+**v3.3.4 fixes (landed 2026-04-24):** four changes — three filed upstream as our PRs, one cherry-picked from @felipetruman's #976:
 
-**Primary concurrency story: palace-daemon integration (in progress).** The v3.3.4 fixes make direct-access palaces survivable, but the architecturally correct answer is to stop having N clients touch the database in the first place. [palace-daemon](https://github.com/rboarescu/palace-daemon) (@rboarescu) is a FastAPI gateway with three asyncio semaphores (read N concurrent / write N/2 concurrent / mine 1 at a time) where the daemon is the only process that opens the palace; clients connect over HTTP via `mempalace-mcp.py` (a stdlib-only MCP proxy) and `hook.py` (a stdlib-only hook runner). A per-port file lock at `/tmp/palace-daemon-8085.lock` enforces one daemon per host+port; the client is hard-coded to fail if the daemon is unreachable, deliberately eliminating split-brain.
+1. **`hnsw:num_threads: 1` pin (cherry-picked from [#976](https://github.com/milla-jovovich/mempalace/pull/976)):** the actual root-cause fix. Disables `ParallelFor` so HNSW inserts serialize within each process. Applied at collection creation metadata + via `_pin_hnsw_threads()` on every `get_collection` (ChromaDB 1.5.x doesn't persist the modified config across reopens). Posted [reproduction data](https://github.com/milla-jovovich/mempalace/pull/976#issuecomment-4316741161) on #976; this fork-local cherry-pick will become a no-op when #976 merges upstream.
+2. **Backend-seam flock ([#1171](https://github.com/milla-jovovich/mempalace/pull/1171)):** `_palace_write_lock(palace_path)` wraps `ChromaCollection.add/upsert/update/delete`. RFC 001 made the adapter the single boundary for all ChromaDB writes, so the lock there covers every caller (mcp_server, miner, convo_miner, palace) automatically. Defends against the race even before the num_threads pin takes effect on first open. Unix-only — Windows is a no-op.
+3. **Quarantine on open ([#1173](https://github.com/milla-jovovich/mempalace/pull/1173)):** `quarantine_stale_hnsw()` now runs inside `ChromaBackend.make_client()` itself (complementary to #1062 which covers server startup). Threshold lowered 3600→300s after a 0.96h-drift segfault.
+4. **Marker guard ([#1177](https://github.com/milla-jovovich/mempalace/pull/1177)):** `.blob_seq_ids_migrated` sentinel file skips `sqlite3.connect()` on already-migrated palaces — opening sqlite against a live ChromaDB 1.5.x WAL database corrupts the next `PersistentClient`. Closes #1090.
 
-palace-daemon pins its correctness floor at MemPalace ≥3.3.2, which aligns with our v3.3.4 reliability stack — the two are compositional. Our flock + quarantine + marker guards continue to matter inside the daemon process (and for anyone running without the daemon), but the daemon's single-process design makes same-machine concurrent-write corruption impossible at the architecture level, and also solves multi-machine access (palace on a home server, clients over LAN) and Windows (where `flock` is a no-op).
+#1 is the *fix*; #2/#3/#4 are defense-in-depth around symptoms (corruption containment, drift recovery, sqlite-state isolation). Together they should eliminate the segfault class for direct-access palaces.
 
-**Current integration shape (JP's fork, local-only):** repo cloned at `~/Projects/palace-daemon`; daemon to run as systemd `--user` service pointing at `~/Projects/mempalace-data/palace/`; Claude Code MCP config rewired from direct `mempalace-mcp` stdio to daemon's `mempalace-mcp.py` HTTP client; `.claude-plugin/hooks/mempal-{stop,precompact}-hook.sh` swapped for `clients/hook.py`. A per-port lock at `/tmp/palace-daemon-8085.lock` enforces one daemon per host+port. Not changing the plugin marketplace default yet — this is JP's personal-install configuration while we validate the swap. If it holds for ~a week, we'll evaluate shipping an opt-in daemon-mode in the marketplace plugin.
+**palace-daemon — deferred pending observation.** [palace-daemon](https://github.com/rboarescu/palace-daemon) (@rboarescu) is a FastAPI gateway with three asyncio semaphores (read N concurrent / write N/2 concurrent / mine 1 at a time) where the daemon is the only process that opens the palace; clients connect over HTTP via `mempalace-mcp.py` (a stdlib-only MCP proxy) and `hook.py` (a stdlib-only hook runner). A per-port file lock at `/tmp/palace-daemon-8085.lock` enforces one daemon per host+port; the client is hard-coded to fail if the daemon is unreachable, deliberately eliminating split-brain. Previously framed here as the "primary concurrency story". With #976's root-cause fix now in our fork, the urgency is materially lower: same-machine concurrent corruption should no longer occur. Repo is cloned at `~/Projects/palace-daemon` and integration is well-scoped if needed (systemd `--user`, swap MCP/hook configs to daemon clients, no plugin marketplace change), but the work is on hold until we observe whether the v3.3.4 stack is genuinely stable in production (~1 week). The daemon remains the right answer for multi-machine access (palace on a home server, clients over LAN) and Windows (where our `flock` is a no-op) — neither is JP's current pain.
 
-**Postgres + pgvector as a parallel track.** RFC 001's backend seam is merged (#413, #995) and the registry already advertises `mempalace_postgres` as the canonical entry-point example. @skuznetsov's [#665](https://github.com/milla-jovovich/mempalace/pull/665) ships the actual PostgreSQL backend implementation (`pg_sorted_heap` preferred path, `pgvector` fallback); @malakhov-dmitrii's [#1072](https://github.com/milla-jovovich/mempalace/pull/1072) wires `palace._DEFAULT_BACKEND` through the registry so `MEMPALACE_BACKEND=postgres` actually takes effect. When both land, switching is `pip install mempalace-postgres && export MEMPALACE_BACKEND=postgres` — no backend authoring needed on our side.
-
-Postgres would eliminate the entire ChromaDB 1.5.x failure class (MVCC for concurrent writes, no HNSW drift, no sqlite3.connect corruption, no Rust-binding segfaults). Migrating 135K+ existing drawers off ChromaDB is a real cost but not code we'd write — `export_palace()` + a Postgres importer against the same backend interface covers it. The remaining question is ordering: palace-daemon is deployable today and wraps the current palace; Postgres needs #665 to land (currently `CONFLICTING`, needs rebase after today's develop merges) plus #1072. Starting with palace-daemon gives us a working multi-client story now and doesn't preclude Postgres later — the daemon is storage-agnostic.
-
-bensig's upcoming TypeScript rewrite (announced in Discord) will pick its own storage layer independent of either path, so "wait on TS" remains an option only if the v3.3.4 defense-in-depth proves fully stable in practice.
+**Postgres + pgvector — long-term option, no immediate move.** RFC 001's backend seam is merged (#413, #995) and the registry already advertises `mempalace_postgres` as the canonical entry-point example. @skuznetsov's [#665](https://github.com/milla-jovovich/mempalace/pull/665) ships the actual PostgreSQL backend implementation (`pg_sorted_heap` preferred path, `pgvector` fallback); @malakhov-dmitrii's [#1072](https://github.com/milla-jovovich/mempalace/pull/1072) wires `palace._DEFAULT_BACKEND` through the registry so `MEMPALACE_BACKEND=postgres` actually takes effect. When both land, switching is `pip install mempalace-postgres && export MEMPALACE_BACKEND=postgres`. Postgres would eliminate the entire ChromaDB 1.5.x failure class natively (MVCC, no HNSW drift, no Rust-binding segfaults), but with the v3.3.4 stack now mitigating that class for direct-access palaces, the migration cost (135K+ drawers off ChromaDB via `export_palace()` + a Postgres importer) isn't justified by current pain. Re-evaluate if the v3.3.4 stack proves unstable, or once bensig's TypeScript rewrite picks its own storage layer.
 
 ### Stale auto-loaded docs