Summary
mempalace repair --mode legacy cannot repair palaces whose corruption manifests at Collection.count() — which is the most common HNSW corruption symptom users hit (status, search, and repair all fail with the same error). The repair tool calls col.count() as its first step, so it bails on the exact case it most needs to handle.
Reproducer
Version: mempalace 3.3.4 (pipx, Python 3.12.7), macOS 14 (Darwin 25.3.0), ChromaDB pinned to current.
State: A palace built on 0.6.x and used continuously through several upgrades (52,300 embeddings, two collections: mempalace_drawers + mempalace_closets). All drawer data is present and readable in chroma.sqlite3; only the HNSW index/WAL is corrupt.
$ mempalace status
chromadb.errors.InternalError: Error executing plan: Error sending backfill
request to compactor: Failed to apply logs to the hnsw segment writer
$ mempalace repair
=======================================================
MemPalace Repair
=======================================================
Palace: /Users/.../.mempalace/palace
Error reading palace: Error executing plan: Error sending backfill request
to compactor: Failed to apply logs to the hnsw segment writer
Cannot recover — palace may need to be re-mined from source files.
$ mempalace repair --mode max-seq-id --dry-run
No poisoned max_seq_id rows detected. Nothing to do.
Root cause
mempalace/repair.py::rebuild_index (3.3.4):
backend = ChromaBackend()
try:
col = backend.get_collection(palace_path, COLLECTION_NAME)
total = col.count() # ← fails here on HNSW corruption
except Exception as e:
print(f" Error reading palace: {e}")
print(" Palace may need to be re-mined from source files.")
return
col.count() triggers ChromaDB's compactor, which tries to apply queued WAL log entries to the HNSW segment writer — and that's where the corruption lives. So:
The user is left with "may need to be re-mined from source files" despite their data being fully intact in chroma.sqlite3. SQLite query confirms it:
sqlite> SELECT segment_id, COUNT(*) FROM embeddings GROUP BY segment_id;
4ed454e5-...|4101
a95d2236-...|48199
Suggested fix
rebuild_index should bypass col.count() and source IDs/documents/metadata directly from SQLite when the chroma client raises an HNSW error. Sketch:
- Try
col.count() / col.get() first (current path — fast and works on healthy palaces).
- On
chromadb.errors.InternalError referencing the HNSW segment, fall through to a SQLite reader that pulls embeddings.embedding_id, embedding_metadata rows, and the document blob directly per collection.
- Then proceed with the existing
delete_collection + recreate + upsert flow.
This would also retire the "may need to be re-mined from source files" advice for any palace where the SQLite layer is intact, which is the common case.
Cross-refs
3.3.4's quarantine logic improves the client open path but doesn't intercept errors raised mid-operation by an already-loaded collection, which is what count()/query() hit. So this gap survives 3.3.4.
Workaround
For users in this state today: extract drawers directly from chroma.sqlite3, then re-upsert into a fresh collection. This is essentially what a fixed rebuild_index would do automatically.
Summary
mempalace repair --mode legacycannot repair palaces whose corruption manifests atCollection.count()— which is the most common HNSW corruption symptom users hit (status, search, and repair all fail with the same error). The repair tool callscol.count()as its first step, so it bails on the exact case it most needs to handle.Reproducer
Version:
mempalace 3.3.4(pipx, Python 3.12.7), macOS 14 (Darwin 25.3.0), ChromaDB pinned to current.State: A palace built on 0.6.x and used continuously through several upgrades (52,300 embeddings, two collections:
mempalace_drawers+mempalace_closets). All drawer data is present and readable inchroma.sqlite3; only the HNSW index/WAL is corrupt.Root cause
mempalace/repair.py::rebuild_index(3.3.4):col.count()triggers ChromaDB's compactor, which tries to apply queued WAL log entries to the HNSW segment writer — and that's where the corruption lives. So:status(miner.py::status→col.count()) — failssearch— fails (PR fix(search): hint atmempalace repairwhen filtered query hits HNSW index mismatch (#1035) #1081 adds a hint, but doesn't recover)repair --mode legacy— fails before extractionrepair --mode max-seq-id— succeeds but doesn't apply (different corruption class)The user is left with
"may need to be re-mined from source files"despite their data being fully intact inchroma.sqlite3. SQLite query confirms it:Suggested fix
rebuild_indexshould bypasscol.count()and source IDs/documents/metadata directly from SQLite when the chroma client raises an HNSW error. Sketch:col.count()/col.get()first (current path — fast and works on healthy palaces).chromadb.errors.InternalErrorreferencing the HNSW segment, fall through to a SQLite reader that pullsembeddings.embedding_id,embedding_metadatarows, and the document blob directly per collection.delete_collection+ recreate + upsert flow.This would also retire the
"may need to be re-mined from source files"advice for any palace where the SQLite layer is intact, which is the common case.Cross-refs
mempalace repairafter #1010 — filteredquery(where=...)fails with 'Error finding id' until HNSW index is rebuilt #1035 (the original 0.6.x → filtered-query failure that prompted PR fix(search): hint atmempalace repairwhen filtered query hits HNSW index mismatch (#1035) #1081)mempalace repairwhen filtered query hits HNSW index mismatch (#1035) #1081 (search-path repair hint — open)quarantine_stale_hnswinto backend open path (follow-up to #1000) #1108 (closed via fix: call quarantine_stale_hnsw() in make_client(); lower threshold to 5min #1173:quarantine_stale_hnswon client open)3.3.4's quarantine logic improves the client open path but doesn't intercept errors raised mid-operation by an already-loaded collection, which is what
count()/query()hit. So this gap survives 3.3.4.Workaround
For users in this state today: extract drawers directly from
chroma.sqlite3, then re-upsert into a fresh collection. This is essentially what a fixedrebuild_indexwould do automatically.