feat(repair): --mode reconcile to re-embed SQL-only orphan rows#1276
Open
sha2fiddy wants to merge 5 commits intoMemPalace:developfrom
Open
feat(repair): --mode reconcile to re-embed SQL-only orphan rows#1276sha2fiddy wants to merge 5 commits intoMemPalace:developfrom
sha2fiddy wants to merge 5 commits intoMemPalace:developfrom
Conversation
Targeted HNSW segment rebuild for palaces corrupted by the chromadb migrate/repair-rebuild resize-drift bug (MemPalace#1046). Rebuilds a single segment in place from data_level0.bin without re-embedding, without touching the rest of the palace, and without invoking the buggy chromadb rebuild path that produced the original corruption. New CLI: mempalace repair --mode hnsw --segment <uuid> [--max-elements N] [--backup/--no-backup] [--purge-queue] [--quarantine-orphans] [--dry-run] Legacy repair path (--mode legacy, default) unchanged except for a one-line offset fix matching the shape already fixed at repair.py:252. Implementation: - Reads vectors directly from data_level0.bin via struct parsing - Reconciles with index_metadata.pickle to drop stale IDs - Rebuilds via chroma-hnswlib in persistent mode - Self-query verifies top-1 match on all vectors before swap - Atomic swap via os.replace with optional timestamped backup - Optional: purge stuck embeddings_queue rows, quarantine orphan metadata rows to a JSON sidecar Dependencies: numpy and chroma-hnswlib imported lazily per CONTRIBUTING.md. Missing deps print install hint and abort gracefully — no changes to pyproject.toml. Tests: 20 new in test_repair.py (synthetic segment fixture covers header parse, vector extract, dedup, space detect, rebuild, verify, swap, rollback, queue purge, orphan quarantine, dry-run). 5 new in test_cli.py for argparse wiring and dispatch. Full non-benchmark suite: 1092/1092. Ruff clean. Coverage on repair.py: 90%. End-to-end validation on synthetic palace: 10 MB bloated link_lists.bin rebuilt to 16 bytes in <0.1s, self-query top-1 match on all vectors, queue purged, orphans quarantined, backup written.
…index_metadata.pickle ChromaDB 0.6.x wrote index_metadata.pickle as an attribute-style object (meta.label_to_id); chromadb 1.5.x writes it as a dict (meta["label_to_id"]). A palace whose pickle was originally created on the older version raises AttributeError in _reconcile_with_pickle and at the end of rebuild_hnsw_segment. Adds two small helpers _meta_get / _meta_set that read and write either shape transparently, and routes the four field accesses (label_to_id, id_to_label, id_to_seq_id, total_elements_added) through them. repair --mode hnsw now works on palaces that have lived through both chromadb versions, not just freshly-created 1.5.x ones.
Two false-positive failure modes on healthy rebuilt indexes: 1. Mined corpora regularly contain byte-identical near-duplicate vectors (e.g. the same code snippet pasted across multiple transcripts). On a correctly rebuilt index, a duplicate can legitimately rank MemPalace#1 instead of the original — verify fails on a rebuild that's actually fine. 2. hnswlib's default ef (~10) is too tight a search beam for ~500k-element indexes with M=16; even a byte-identical self-match can be missed because its neighborhood in the HNSW graph is sparse. Loosens the assertion to "self appears in top-k=10" and bumps index.set_ef(max(200, k*4)) before querying. Both narrow false positives; neither weakens detection of real corruption (a truly broken index still misses its own labels everywhere, not just outside top-1). ChromaDB sets its own ef at query time, so this only affects the verify step. Found running --mode hnsw against a real ~500k-drawer palace.
Some chromadb crash modes (chroma-core/chroma#6979) commit embeddings/embedding_metadata rows transactionally to SQL but lose the corresponding HNSW additions, leaving drawers visible to metadata queries but unreachable to vector search. Hit on a real 533k-drawer palace 2026-04-29: a single 04-28 mining run produced 10,404 such SQL-only orphans. PR MemPalace#1271's pickle-counts gate would not have caught this — the pickle was internally consistent for what landed; the failure mode is a partial-flush ingest, not pickle corruption. `mempalace repair --mode reconcile --segment <uuid>` finds those orphans, embeds their `chroma:document` payloads with the palace's configured EF, extracts existing vectors from `data_level0.bin` (no re-embed of healthy rows), builds a fresh persistent HNSW index containing both, and atomic- swaps it in. Same safety profile as `--mode hnsw`: single-threaded writes, pickle persisted exactly once at end, rollback on any failure. Verified end-to-end against a live 533,342-drawer palace: 533,342 == 533,342 strict zero-diff post-run, self-query passes, search returns reconciled drawers as top hits. Tests: 6 new in tests/test_repair.py — METADATA-sibling auto-detection, dry-run no-mutation, happy-path append + pickle update, no-orphans short-circuit, off-by-one pickle abort, missing-METADATA abort. 54/54 in test_repair.py and 1349/1349 in the full suite pass. ruff clean. No new dependencies.
f1e16ea to
ef6d21e
Compare
Contributor
Author
|
Rebased on upstream/develop @fdfaf01, conflict resolved in |
Member
|
Moved to v3.4 milestone — |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
mempalace repair --mode reconcile --segment <uuid>finds drawers that have rows inembedding_metadatabut no HNSW label, re-embeds them, rebuilds the segment with old + new vectors, and atomic-swaps it in.Hit my own 533k-drawer palace 2026-04-29: one 04-28 mining run left 10,404 SQL-only orphans. chromadb crashed mid-flush, SQL committed transactionally, HNSW additions lost. PR #1271's pickle-counts gate doesn't catch this because the pickle was internally consistent for what actually landed.
Tradeoffs
_extract_vectors,_build_persistent_index,_atomic_swap_segment. Inlining would have been ~500 LOC duplication. Cost of stacking is that this can't merge until feat: add mempalace repair --mode hnsw --segment <uuid> (#1046) #1126 does.bdc58f8f. Modifying the live pickle in place is exactly the failure mode I'm trying not to repeat.--metadata-segmentflag handles the rare ambiguous case.Tests
6 new in
test_repair.py: auto-detect, dry-run, happy-path append, no-orphans short-circuit, off-by-one pickle abort, missing-METADATA abort. 54/54 intest_repair.py, 1349/1349 full suite, ruff clean, no new deps.Verified against my live 533,342-drawer palace. Strict zero-diff after the run, and
mempalace searchreturns reconciled drawers as top hits.