Skip to content

migrate and repair rebuild trigger unbounded link_lists.bin growth on large palaces #1046

@sha2fiddy

Description

@sha2fiddy

Summary

Both mempalace migrate and python -m mempalace.repair rebuild re-insert every drawer into a fresh chromadb collection. On a 199,539-drawer / 1.7 GB palace, each path independently produced a multi-hundred-GB sparse link_lists.bin and filled the disk. These are two reproducible entry points for the hnswlib resize-drift bug documented in #344 that are not currently covered by any report, PR, or safeguard.

Root cause is upstream (chroma-core/chroma#2594 — hnswlib link_lists.bin grows unbounded, acknowledged as "by design"), but mempalace is the one calling it, has no disk-space or post-write integrity checks on these paths, and repair rebuild is literally the recommended recovery from HNSW corruption — it reproduces the bug it is meant to fix.

Environment

  • mempalace develop @ 32ec74d (v3.3.0), installed via uv tool install --python 3.12
  • chromadb 1.5.8
  • Python 3.12, macOS 15 (Darwin 25.4.0)
  • Palace: 199,539 drawers, 1.7 GB on disk pre-migration, stable on chromadb 0.6.3 for weeks prior

Variant A — mempalace migrate --yes (0.6.3 → 1.5.8 schema)

Migration ran ~1h 45min, wrote the pre-migration backup, and exited with zero errors. The final HNSW write step silently corrupted the new collection:

File On disk Logical State
data_level0.bin 349 MB 349 MB OK
link_lists.bin 647 GB 5.04 TB sparse, runaway
index_metadata.pickle small small truncated mid-write

Palace is unopenable: loading the pickle raises _pickle.UnpicklingError: pickle data was truncated. Migrate does not verify HNSW state or pickle integrity before returning success.

Reproduce: on a palace of ~100K+ drawers on chromadb 0.6.3, run mempalace migrate --yes to 1.5.x. Watch du -sh ~/.mempalace/palace/*/link_lists.bin during the final write phase.

Variant B — python -m mempalace.repair rebuild

Runs without any schema migration. After quarantining the corrupt segment from Variant A (rename → .drift-<date>) and invoking repair rebuild, a new collection UUID was created and the upsert phase began:

  • link_lists.bin reached 122 GB on disk / 223 GB logical in under 20 min before a disk-usage watchdog tripped.
  • pkill -9 -f "mempalace.repair" returned without confirming child-process death. Workers kept writing to the deleted file (open FDs), so disk was not reclaimed until a second pkill pass closed all FDs (~450 GB reclaim).

repair rebuild is the documented recovery path from HNSW corruption — but runs the same hnswlib writer that caused it. No escape hatch.

Reproduce: on any palace of ~50K+ drawers on chromadb 1.5.x, python -m mempalace.repair rebuild. Watch du -sh ~/.mempalace/palace/*/link_lists.bin during upsert.

Related

Suggested mitigations (mempalace-side, pending upstream fix)

  1. Pre-flight disk check in migrate and repair rebuild: abort if free space < 10× palace size.
  2. Live size guard during HNSW writes: if link_lists.bin exceeds N× data_level0.bin, stop and restore backup.
  3. Post-op integrity check: open the new collection, confirm count() matches source and index_metadata.pickle unpickles, before reporting success.
  4. Apply PR fix: prevent HNSW index bloat from resize+persist cycles #346's hnsw:batch_size / hnsw:sync_threshold overrides at the collection-creation sites in migrate.py and repair.py, not only miner.py.
  5. Fix watchdog termination in repair.py: after pkill -9, re-pgrep and retry before rm-ing any segment file (open-but-deleted FDs block reclaim).

Current workaround

Restored pre-migration backup, pinned chromadb==0.6.3 in pyproject.toml, reinstalled via uv tool install --force, re-applied the macOS CPU-embedding-provider patch (abandoned branch fix/cpu-embedding-provider-macos). Palace opens, count() == 199540. Will not run migrate or repair rebuild until this is resolved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingperformancePerformance improvementsstorage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions