Skip to content

fix(compress): drop and recreate mempalace_compressed to avoid HNSW link_lists.bin inflation#1273

Open
guilhermefriol wants to merge 1 commit intoMemPalace:developfrom
guilhermefriol:fix/compress-hnsw-link-lists-bloat
Open

fix(compress): drop and recreate mempalace_compressed to avoid HNSW link_lists.bin inflation#1273
guilhermefriol wants to merge 1 commit intoMemPalace:developfrom
guilhermefriol:fix/compress-hnsw-link-lists-bloat

Conversation

@guilhermefriol
Copy link
Copy Markdown

Context

Adds a localized fix for one disk-fill vector reported in #1092 (and reproduced in my #1272, now closed as duplicate): the mempalace compress code path.

In my case the mempalace_compressed collection's link_lists.bin grew to 1.7 TB physical / 17 TB logical (sparse) for ~13K entries on a 1.8 TB disk, after running mempalace compress twice in one day with the MCP server and Stop/PreCompact hooks active in parallel. This PR doesn't address the underlying chromadb 1.5.8 concurrent-writer issue (that's broader — see #1092 for the full picture), but it removes one practical way the disk fills.

Fix

Drop and recreate the mempalace_compressed collection at the start of each compress run, before the upsert loop. Each run starts with a fresh HNSW index, so accumulation in link_lists.bin cannot happen.

Diff is small (~5 lines added) in cmd_compress (mempalace/cli.py).

Trade-off

Re-vectorizing all compressed entries on every compress run. With the local ONNX embedder and ~10K entries this is on the order of minutes, well within the budget for a weekly cron job. The alternative (silent disk fill) is much worse.

This is consistent with what the miner already does: miner.py:718-726 deletes by source_file before re-inserting to bypass hnswlib's updatePoint path that triggers the same kind of behavior. The compress path was the missing parallel.

Testing

Patched a local pipx install with this change and re-ran mempalace compress against the same palace several times in a row. du -sh ~/.mempalace/palace/<compressed_uuid> stayed proportional to the actual entry count across runs — no growth.

Notes

  • Doesn't fix the underlying chromadb 1.x sparse-file behavior. A separate upstream report in chroma-core/chroma would be the right path for that.
  • verbatim always is preserved: the user-facing data lives in mempalace_drawers, untouched. mempalace_compressed is a derived index of AAAK summaries, regenerated from those originals on each compress.

Related: #1092

…nflation

Repeated upserts to the mempalace_compressed collection across runs
cause the HNSW link_lists.bin sparse file to grow without GC,
eventually filling the disk (observed: 1.7 TB physical, 17 TB logical,
on macOS ARM with chromadb 1.5.8).

Drop and recreate the collection at the start of each compress run so
the HNSW index is rebuilt from scratch each time. Re-vectorizing ~10K
embeddings costs a few minutes on the local ONNX backend; far cheaper
than risking TBs of disk.

The miner code already does the equivalent (delete-by-source_file
before re-insert, see miner.py:718) for the same hnswlib behavior.
This brings the compress path in line.

Related: MemPalace#1092
@igorls
Copy link
Copy Markdown
Member

igorls commented May 6, 2026

Attempted to rebase onto develop but the conflict surfaces a semantic problem we should resolve before merging:

Context: #1244 (merged) renamed the compress-target collection from mempalace_compressed to the shared mempalace_closets. After that change, the same collection is now also written to by mempalace mine and regenerate_closets (#1107).

Problem: This PR's fix — backend.delete_collection(palace_path, "mempalace_compressed") followed by recreate — was safe when the collection was a dedicated per-run scratch space. Translating it to mempalace_closets (the new target) would silently destroy entries from mining and regenerate_closets, which is much worse than the HNSW link_lists.bin inflation we're trying to fix.

Suggested re-scoping: instead of drop+recreate, delete only the IDs this compress run is about to upsert, then upsert. That preserves the GC behavior locally without affecting other writers' data:

ids_to_write = [doc_id for doc_id, *_ in compressed_entries]
try:
    comp_col.delete(ids=ids_to_write)
except Exception:
    pass
comp_col.upsert(...)

Happy to push that as a follow-up commit if you want, but flagging here in case there's a reason to take a different approach (e.g., a separate dedicated collection for compress run scratch data, with a periodic compaction job).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands bug Something isn't working storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants