You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Both mempalace migrate and python -m mempalace.repair rebuild re-insert every drawer into a fresh chromadb collection. On a 199,539-drawer / 1.7 GB palace, each path independently produced a multi-hundred-GB sparse link_lists.bin and filled the disk. These are two reproducible entry points for the hnswlib resize-drift bug documented in #344 that are not currently covered by any report, PR, or safeguard.
Root cause is upstream (chroma-core/chroma#2594 — hnswlib link_lists.bin grows unbounded, acknowledged as "by design"), but mempalace is the one calling it, has no disk-space or post-write integrity checks on these paths, and repair rebuild is literally the recommended recovery from HNSW corruption — it reproduces the bug it is meant to fix.
Palace: 199,539 drawers, 1.7 GB on disk pre-migration, stable on chromadb 0.6.3 for weeks prior
Variant A — mempalace migrate --yes (0.6.3 → 1.5.8 schema)
Migration ran ~1h 45min, wrote the pre-migration backup, and exited with zero errors. The final HNSW write step silently corrupted the new collection:
File
On disk
Logical
State
data_level0.bin
349 MB
349 MB
OK
link_lists.bin
647 GB
5.04 TB
sparse, runaway
index_metadata.pickle
small
small
truncated mid-write
Palace is unopenable: loading the pickle raises _pickle.UnpicklingError: pickle data was truncated. Migrate does not verify HNSW state or pickle integrity before returning success.
Reproduce: on a palace of ~100K+ drawers on chromadb 0.6.3, run mempalace migrate --yes to 1.5.x. Watch du -sh ~/.mempalace/palace/*/link_lists.bin during the final write phase.
Variant B — python -m mempalace.repair rebuild
Runs without any schema migration. After quarantining the corrupt segment from Variant A (rename → .drift-<date>) and invoking repair rebuild, a new collection UUID was created and the upsert phase began:
link_lists.bin reached 122 GB on disk / 223 GB logical in under 20 min before a disk-usage watchdog tripped.
pkill -9 -f "mempalace.repair" returned without confirming child-process death. Workers kept writing to the deleted file (open FDs), so disk was not reclaimed until a second pkill pass closed all FDs (~450 GB reclaim).
repair rebuild is the documented recovery path from HNSW corruption — but runs the same hnswlib writer that caused it. No escape hatch.
Reproduce: on any palace of ~50K+ drawers on chromadb 1.5.x, python -m mempalace.repair rebuild. Watch du -sh ~/.mempalace/palace/*/link_lists.bin during upsert.
Fix watchdog termination in repair.py: after pkill -9, re-pgrep and retry before rm-ing any segment file (open-but-deleted FDs block reclaim).
Current workaround
Restored pre-migration backup, pinned chromadb==0.6.3 in pyproject.toml, reinstalled via uv tool install --force, re-applied the macOS CPU-embedding-provider patch (abandoned branch fix/cpu-embedding-provider-macos). Palace opens, count() == 199540. Will not run migrate or repair rebuild until this is resolved.
Summary
Both
mempalace migrateandpython -m mempalace.repair rebuildre-insert every drawer into a fresh chromadb collection. On a 199,539-drawer / 1.7 GB palace, each path independently produced a multi-hundred-GB sparselink_lists.binand filled the disk. These are two reproducible entry points for the hnswlib resize-drift bug documented in #344 that are not currently covered by any report, PR, or safeguard.Root cause is upstream (chroma-core/chroma#2594 — hnswlib
link_lists.bingrows unbounded, acknowledged as "by design"), but mempalace is the one calling it, has no disk-space or post-write integrity checks on these paths, andrepair rebuildis literally the recommended recovery from HNSW corruption — it reproduces the bug it is meant to fix.Environment
develop@32ec74d(v3.3.0), installed viauv tool install --python 3.12Variant A —
mempalace migrate --yes(0.6.3 → 1.5.8 schema)Migration ran ~1h 45min, wrote the pre-migration backup, and exited with zero errors. The final HNSW write step silently corrupted the new collection:
data_level0.binlink_lists.binindex_metadata.picklePalace is unopenable: loading the pickle raises
_pickle.UnpicklingError: pickle data was truncated. Migrate does not verify HNSW state or pickle integrity before returning success.Reproduce: on a palace of ~100K+ drawers on chromadb 0.6.3, run
mempalace migrate --yesto 1.5.x. Watchdu -sh ~/.mempalace/palace/*/link_lists.binduring the final write phase.Variant B —
python -m mempalace.repair rebuildRuns without any schema migration. After quarantining the corrupt segment from Variant A (rename →
.drift-<date>) and invokingrepair rebuild, a new collection UUID was created and the upsert phase began:link_lists.binreached 122 GB on disk / 223 GB logical in under 20 min before a disk-usage watchdog tripped.pkill -9 -f "mempalace.repair"returned without confirming child-process death. Workers kept writing to the deleted file (open FDs), so disk was not reclaimed until a secondpkillpass closed all FDs (~450 GB reclaim).repair rebuildis the documented recovery path from HNSW corruption — but runs the same hnswlib writer that caused it. No escape hatch.Reproduce: on any palace of ~50K+ drawers on chromadb 1.5.x,
python -m mempalace.repair rebuild. Watchdu -sh ~/.mempalace/palace/*/link_lists.binduring upsert.Related
mempalace mineat 56K drawers. Documents the resize-drift analysis.add()vsupsert()on re-mine. Different trigger.link_lists.binduringmineon chromadb 1.x (threaded race).hnsw:batch_size=50000,hnsw:sync_threshold=50000forminer.py. Does not touchmigrate.pyorrepair.py.hnsw:num_threads=1+mine_global_lock. Addresses threading, not resize drift.hnsw:initial_capacity, unmerged), [Security] Unsafe pickle.load() in PersistentLocalHnswSegment enables arbitrary code execution (CWE-502) chroma-core/chroma#6926 (pickle RCE — confirms pickle still on load path).Suggested mitigations (mempalace-side, pending upstream fix)
migrateandrepair rebuild: abort if free space < 10× palace size.link_lists.binexceeds N×data_level0.bin, stop and restore backup.count()matches source andindex_metadata.pickleunpickles, before reporting success.hnsw:batch_size/hnsw:sync_thresholdoverrides at the collection-creation sites inmigrate.pyandrepair.py, not onlyminer.py.repair.py: afterpkill -9, re-pgrepand retry beforerm-ing any segment file (open-but-deleted FDs block reclaim).Current workaround
Restored pre-migration backup, pinned
chromadb==0.6.3inpyproject.toml, reinstalled viauv tool install --force, re-applied the macOS CPU-embedding-provider patch (abandoned branchfix/cpu-embedding-provider-macos). Palace opens,count() == 199540. Will not runmigrateorrepair rebuilduntil this is resolved.