fix(compress): drop and recreate mempalace_compressed to avoid HNSW link_lists.bin inflation by guilhermefriol · Pull Request #1273 · MemPalace/mempalace

guilhermefriol · 2026-04-29T17:26:47Z

Context

Adds a localized fix for one disk-fill vector reported in #1092 (and reproduced in my #1272, now closed as duplicate): the mempalace compress code path.

In my case the mempalace_compressed collection's link_lists.bin grew to 1.7 TB physical / 17 TB logical (sparse) for ~13K entries on a 1.8 TB disk, after running mempalace compress twice in one day with the MCP server and Stop/PreCompact hooks active in parallel. This PR doesn't address the underlying chromadb 1.5.8 concurrent-writer issue (that's broader — see #1092 for the full picture), but it removes one practical way the disk fills.

Fix

Drop and recreate the mempalace_compressed collection at the start of each compress run, before the upsert loop. Each run starts with a fresh HNSW index, so accumulation in link_lists.bin cannot happen.

Diff is small (~5 lines added) in cmd_compress (mempalace/cli.py).

Trade-off

Re-vectorizing all compressed entries on every compress run. With the local ONNX embedder and ~10K entries this is on the order of minutes, well within the budget for a weekly cron job. The alternative (silent disk fill) is much worse.

This is consistent with what the miner already does: miner.py:718-726 deletes by source_file before re-inserting to bypass hnswlib's updatePoint path that triggers the same kind of behavior. The compress path was the missing parallel.

Testing

Patched a local pipx install with this change and re-ran mempalace compress against the same palace several times in a row. du -sh ~/.mempalace/palace/<compressed_uuid> stayed proportional to the actual entry count across runs — no growth.

Notes

Doesn't fix the underlying chromadb 1.x sparse-file behavior. A separate upstream report in chroma-core/chroma would be the right path for that.
verbatim always is preserved: the user-facing data lives in mempalace_drawers, untouched. mempalace_compressed is a derived index of AAAK summaries, regenerated from those originals on each compress.

Related: #1092

…nflation Repeated upserts to the mempalace_compressed collection across runs cause the HNSW link_lists.bin sparse file to grow without GC, eventually filling the disk (observed: 1.7 TB physical, 17 TB logical, on macOS ARM with chromadb 1.5.8). Drop and recreate the collection at the start of each compress run so the HNSW index is rebuilt from scratch each time. Re-vectorizing ~10K embeddings costs a few minutes on the local ONNX backend; far cheaper than risking TBs of disk. The miner code already does the equivalent (delete-by-source_file before re-insert, see miner.py:718) for the same hnswlib behavior. This brings the compress path in line. Related: MemPalace#1092

igorls · 2026-05-06T05:00:55Z

Attempted to rebase onto develop but the conflict surfaces a semantic problem we should resolve before merging:

Context: #1244 (merged) renamed the compress-target collection from mempalace_compressed to the shared mempalace_closets. After that change, the same collection is now also written to by mempalace mine and regenerate_closets (#1107).

Problem: This PR's fix — backend.delete_collection(palace_path, "mempalace_compressed") followed by recreate — was safe when the collection was a dedicated per-run scratch space. Translating it to mempalace_closets (the new target) would silently destroy entries from mining and regenerate_closets, which is much worse than the HNSW link_lists.bin inflation we're trying to fix.

Suggested re-scoping: instead of drop+recreate, delete only the IDs this compress run is about to upsert, then upsert. That preserves the GC behavior locally without affecting other writers' data:

ids_to_write = [doc_id for doc_id, *_ in compressed_entries]
try:
    comp_col.delete(ids=ids_to_write)
except Exception:
    pass
comp_col.upsert(...)

Happy to push that as a follow-up commit if you want, but flagging here in case there's a reason to take a different approach (e.g., a separate dedicated collection for compress run scratch data, with a periodic compaction job).

guilhermefriol requested review from bensig and milla-jovovich as code owners April 29, 2026 17:26

igorls added this to the v3.3.5 milestone May 1, 2026

Seph396 mentioned this pull request May 1, 2026

Multi-surface Claude Desktop support: Unix socket stdio multiplexer workaround + findings #1300

Open

igorls added area/cli CLI commands bug Something isn't working storage labels May 2, 2026

zhapostolski mentioned this pull request May 3, 2026

feat: community patches — mine deadlock fix, block_once precompact, KG auto-write, repair BLOB seq_id #1335

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(compress): drop and recreate mempalace_compressed to avoid HNSW link_lists.bin inflation#1273

fix(compress): drop and recreate mempalace_compressed to avoid HNSW link_lists.bin inflation#1273
guilhermefriol wants to merge 1 commit intoMemPalace:developfrom
guilhermefriol:fix/compress-hnsw-link-lists-bloat

guilhermefriol commented Apr 29, 2026

Uh oh!

igorls commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guilhermefriol commented Apr 29, 2026

Context

Fix

Trade-off

Testing

Notes

Uh oh!

igorls commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants