Skip to content

feat(repair): --mode reconcile to re-embed SQL-only orphan rows#1276

Open
sha2fiddy wants to merge 5 commits intoMemPalace:developfrom
sha2fiddy:feat/repair-mode-reconcile
Open

feat(repair): --mode reconcile to re-embed SQL-only orphan rows#1276
sha2fiddy wants to merge 5 commits intoMemPalace:developfrom
sha2fiddy:feat/repair-mode-reconcile

Conversation

@sha2fiddy
Copy link
Copy Markdown
Contributor

What

mempalace repair --mode reconcile --segment <uuid> finds drawers that have rows in embedding_metadata but no HNSW label, re-embeds them, rebuilds the segment with old + new vectors, and atomic-swaps it in.

Hit my own 533k-drawer palace 2026-04-29: one 04-28 mining run left 10,404 SQL-only orphans. chromadb crashed mid-flush, SQL committed transactionally, HNSW additions lost. PR #1271's pickle-counts gate doesn't catch this because the pickle was internally consistent for what actually landed.

Tradeoffs

Tests

6 new in test_repair.py: auto-detect, dry-run, happy-path append, no-orphans short-circuit, off-by-one pickle abort, missing-METADATA abort. 54/54 in test_repair.py, 1349/1349 full suite, ruff clean, no new deps.

Verified against my live 533,342-drawer palace. Strict zero-diff after the run, and mempalace search returns reconciled drawers as top hits.

@sha2fiddy sha2fiddy marked this pull request as ready for review April 30, 2026 13:19
Targeted HNSW segment rebuild for palaces corrupted by the chromadb
migrate/repair-rebuild resize-drift bug (MemPalace#1046). Rebuilds a single
segment in place from data_level0.bin without re-embedding, without
touching the rest of the palace, and without invoking the buggy
chromadb rebuild path that produced the original corruption.

New CLI:
  mempalace repair --mode hnsw --segment <uuid>
                   [--max-elements N] [--backup/--no-backup]
                   [--purge-queue] [--quarantine-orphans] [--dry-run]

Legacy repair path (--mode legacy, default) unchanged except for a
one-line offset fix matching the shape already fixed at repair.py:252.

Implementation:
- Reads vectors directly from data_level0.bin via struct parsing
- Reconciles with index_metadata.pickle to drop stale IDs
- Rebuilds via chroma-hnswlib in persistent mode
- Self-query verifies top-1 match on all vectors before swap
- Atomic swap via os.replace with optional timestamped backup
- Optional: purge stuck embeddings_queue rows, quarantine orphan
  metadata rows to a JSON sidecar

Dependencies: numpy and chroma-hnswlib imported lazily per
CONTRIBUTING.md. Missing deps print install hint and abort
gracefully — no changes to pyproject.toml.

Tests: 20 new in test_repair.py (synthetic segment fixture covers
header parse, vector extract, dedup, space detect, rebuild, verify,
swap, rollback, queue purge, orphan quarantine, dry-run). 5 new in
test_cli.py for argparse wiring and dispatch. Full non-benchmark
suite: 1092/1092. Ruff clean. Coverage on repair.py: 90%.

End-to-end validation on synthetic palace: 10 MB bloated
link_lists.bin rebuilt to 16 bytes in <0.1s, self-query top-1 match
on all vectors, queue purged, orphans quarantined, backup written.
…index_metadata.pickle

ChromaDB 0.6.x wrote index_metadata.pickle as an attribute-style object
(meta.label_to_id); chromadb 1.5.x writes it as a dict (meta["label_to_id"]).
A palace whose pickle was originally created on the older version raises
AttributeError in _reconcile_with_pickle and at the end of rebuild_hnsw_segment.

Adds two small helpers _meta_get / _meta_set that read and write either shape
transparently, and routes the four field accesses (label_to_id, id_to_label,
id_to_seq_id, total_elements_added) through them.

repair --mode hnsw now works on palaces that have lived through both chromadb
versions, not just freshly-created 1.5.x ones.
Two false-positive failure modes on healthy rebuilt indexes:

1. Mined corpora regularly contain byte-identical near-duplicate vectors
   (e.g. the same code snippet pasted across multiple transcripts). On a
   correctly rebuilt index, a duplicate can legitimately rank MemPalace#1 instead
   of the original — verify fails on a rebuild that's actually fine.

2. hnswlib's default ef (~10) is too tight a search beam for ~500k-element
   indexes with M=16; even a byte-identical self-match can be missed because
   its neighborhood in the HNSW graph is sparse.

Loosens the assertion to "self appears in top-k=10" and bumps
index.set_ef(max(200, k*4)) before querying. Both narrow false positives;
neither weakens detection of real corruption (a truly broken index still
misses its own labels everywhere, not just outside top-1). ChromaDB sets
its own ef at query time, so this only affects the verify step.

Found running --mode hnsw against a real ~500k-drawer palace.
Some chromadb crash modes (chroma-core/chroma#6979) commit
embeddings/embedding_metadata rows transactionally to SQL but lose the
corresponding HNSW additions, leaving drawers visible to metadata queries
but unreachable to vector search. Hit on a real 533k-drawer palace
2026-04-29: a single 04-28 mining run produced 10,404 such SQL-only
orphans. PR MemPalace#1271's pickle-counts gate would not have caught this — the
pickle was internally consistent for what landed; the failure mode is a
partial-flush ingest, not pickle corruption.

`mempalace repair --mode reconcile --segment <uuid>` finds those orphans,
embeds their `chroma:document` payloads with the palace's configured EF,
extracts existing vectors from `data_level0.bin` (no re-embed of healthy
rows), builds a fresh persistent HNSW index containing both, and atomic-
swaps it in. Same safety profile as `--mode hnsw`: single-threaded writes,
pickle persisted exactly once at end, rollback on any failure.

Verified end-to-end against a live 533,342-drawer palace: 533,342 ==
533,342 strict zero-diff post-run, self-query passes, search returns
reconciled drawers as top hits.

Tests: 6 new in tests/test_repair.py — METADATA-sibling auto-detection,
dry-run no-mutation, happy-path append + pickle update, no-orphans
short-circuit, off-by-one pickle abort, missing-METADATA abort. 54/54
in test_repair.py and 1349/1349 in the full suite pass. ruff clean. No
new dependencies.
@sha2fiddy sha2fiddy force-pushed the feat/repair-mode-reconcile branch from f1e16ea to ef6d21e Compare April 30, 2026 14:09
@sha2fiddy sha2fiddy marked this pull request as draft April 30, 2026 14:09
@sha2fiddy
Copy link
Copy Markdown
Contributor Author

Rebased on upstream/develop @fdfaf01, conflict resolved in mempalace/cli.py (argparse merge of --mode reconcile + --metadata-segment alongside upstream --mode max-seq-id + --from-sidecar). One ruff-format reflow folded into top commit. 120/120 targeted tests pass, ruff clean. Marked draft until CI runs green.

@sha2fiddy sha2fiddy marked this pull request as ready for review April 30, 2026 14:35
@igorls igorls added this to the v3.3.5 milestone May 1, 2026
@igorls igorls added area/cli CLI commands enhancement New feature or request labels May 2, 2026
@igorls igorls modified the milestones: v3.3.5, v3.4 May 3, 2026
@igorls
Copy link
Copy Markdown
Member

igorls commented May 3, 2026

Moved to v3.4 milestone — --mode reconcile is an additive feature (new repair mode that re-embeds SQL-only orphan rows). v3.3.5 is scoped to maintenance / bug fixes; new modes belong in the next feature release. Review continues — we'll merge in the v3.4 cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants