Skip to content

fix(repair): rebuild_index bails on col.count() — exactly the call HNSW corruption breaks #1308

@potterdigital

Description

@potterdigital

Summary

mempalace repair --mode legacy cannot repair palaces whose corruption manifests at Collection.count() — which is the most common HNSW corruption symptom users hit (status, search, and repair all fail with the same error). The repair tool calls col.count() as its first step, so it bails on the exact case it most needs to handle.

Reproducer

Version: mempalace 3.3.4 (pipx, Python 3.12.7), macOS 14 (Darwin 25.3.0), ChromaDB pinned to current.

State: A palace built on 0.6.x and used continuously through several upgrades (52,300 embeddings, two collections: mempalace_drawers + mempalace_closets). All drawer data is present and readable in chroma.sqlite3; only the HNSW index/WAL is corrupt.

$ mempalace status
chromadb.errors.InternalError: Error executing plan: Error sending backfill
request to compactor: Failed to apply logs to the hnsw segment writer

$ mempalace repair
=======================================================
  MemPalace Repair
=======================================================
  Palace: /Users/.../.mempalace/palace
  Error reading palace: Error executing plan: Error sending backfill request
  to compactor: Failed to apply logs to the hnsw segment writer
  Cannot recover — palace may need to be re-mined from source files.

$ mempalace repair --mode max-seq-id --dry-run
  No poisoned max_seq_id rows detected. Nothing to do.

Root cause

mempalace/repair.py::rebuild_index (3.3.4):

backend = ChromaBackend()
try:
    col = backend.get_collection(palace_path, COLLECTION_NAME)
    total = col.count()                          # ← fails here on HNSW corruption
except Exception as e:
    print(f"  Error reading palace: {e}")
    print("  Palace may need to be re-mined from source files.")
    return

col.count() triggers ChromaDB's compactor, which tries to apply queued WAL log entries to the HNSW segment writer — and that's where the corruption lives. So:

The user is left with "may need to be re-mined from source files" despite their data being fully intact in chroma.sqlite3. SQLite query confirms it:

sqlite> SELECT segment_id, COUNT(*) FROM embeddings GROUP BY segment_id;
4ed454e5-...|4101
a95d2236-...|48199

Suggested fix

rebuild_index should bypass col.count() and source IDs/documents/metadata directly from SQLite when the chroma client raises an HNSW error. Sketch:

  1. Try col.count() / col.get() first (current path — fast and works on healthy palaces).
  2. On chromadb.errors.InternalError referencing the HNSW segment, fall through to a SQLite reader that pulls embeddings.embedding_id, embedding_metadata rows, and the document blob directly per collection.
  3. Then proceed with the existing delete_collection + recreate + upsert flow.

This would also retire the "may need to be re-mined from source files" advice for any palace where the SQLite layer is intact, which is the common case.

Cross-refs

3.3.4's quarantine logic improves the client open path but doesn't intercept errors raised mid-operation by an already-loaded collection, which is what count()/query() hit. So this gap survives 3.3.4.

Workaround

For users in this state today: extract drawers directly from chroma.sqlite3, then re-upsert into a fresh collection. This is essentially what a fixed rebuild_index would do automatically.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions