fix(compress): drop and recreate mempalace_compressed to avoid HNSW link_lists.bin inflation#1273
Conversation
…nflation Repeated upserts to the mempalace_compressed collection across runs cause the HNSW link_lists.bin sparse file to grow without GC, eventually filling the disk (observed: 1.7 TB physical, 17 TB logical, on macOS ARM with chromadb 1.5.8). Drop and recreate the collection at the start of each compress run so the HNSW index is rebuilt from scratch each time. Re-vectorizing ~10K embeddings costs a few minutes on the local ONNX backend; far cheaper than risking TBs of disk. The miner code already does the equivalent (delete-by-source_file before re-insert, see miner.py:718) for the same hnswlib behavior. This brings the compress path in line. Related: MemPalace#1092
|
Attempted to rebase onto develop but the conflict surfaces a semantic problem we should resolve before merging: Context: #1244 (merged) renamed the compress-target collection from Problem: This PR's fix — Suggested re-scoping: instead of drop+recreate, delete only the IDs this compress run is about to upsert, then upsert. That preserves the GC behavior locally without affecting other writers' data: ids_to_write = [doc_id for doc_id, *_ in compressed_entries]
try:
comp_col.delete(ids=ids_to_write)
except Exception:
pass
comp_col.upsert(...)Happy to push that as a follow-up commit if you want, but flagging here in case there's a reason to take a different approach (e.g., a separate dedicated collection for compress run scratch data, with a periodic compaction job). |
Context
Adds a localized fix for one disk-fill vector reported in #1092 (and reproduced in my #1272, now closed as duplicate): the
mempalace compresscode path.In my case the
mempalace_compressedcollection'slink_lists.bingrew to 1.7 TB physical / 17 TB logical (sparse) for ~13K entries on a 1.8 TB disk, after runningmempalace compresstwice in one day with the MCP server and Stop/PreCompact hooks active in parallel. This PR doesn't address the underlying chromadb 1.5.8 concurrent-writer issue (that's broader — see #1092 for the full picture), but it removes one practical way the disk fills.Fix
Drop and recreate the
mempalace_compressedcollection at the start of each compress run, before the upsert loop. Each run starts with a fresh HNSW index, so accumulation inlink_lists.bincannot happen.Diff is small (~5 lines added) in
cmd_compress(mempalace/cli.py).Trade-off
Re-vectorizing all compressed entries on every compress run. With the local ONNX embedder and ~10K entries this is on the order of minutes, well within the budget for a weekly cron job. The alternative (silent disk fill) is much worse.
This is consistent with what the miner already does:
miner.py:718-726deletes bysource_filebefore re-inserting to bypass hnswlib'supdatePointpath that triggers the same kind of behavior. The compress path was the missing parallel.Testing
Patched a local pipx install with this change and re-ran
mempalace compressagainst the same palace several times in a row.du -sh ~/.mempalace/palace/<compressed_uuid>stayed proportional to the actual entry count across runs — no growth.Notes
chroma-core/chromawould be the right path for that.verbatim alwaysis preserved: the user-facing data lives inmempalace_drawers, untouched.mempalace_compressedis a derived index of AAAK summaries, regenerated from those originals on each compress.Related: #1092