migrate and repair rebuild trigger unbounded link_lists.bin growth on large palaces

## Summary

Both `mempalace migrate` and `python -m mempalace.repair rebuild` re-insert every drawer into a fresh chromadb collection. On a 199,539-drawer / 1.7 GB palace, each path independently produced a multi-hundred-GB sparse `link_lists.bin` and filled the disk. These are two reproducible entry points for the hnswlib resize-drift bug documented in #344 that are not currently covered by any report, PR, or safeguard.

Root cause is upstream (chroma-core/chroma#2594 — hnswlib `link_lists.bin` grows unbounded, acknowledged as "by design"), but mempalace is the one calling it, has no disk-space or post-write integrity checks on these paths, and `repair rebuild` is literally the recommended recovery from HNSW corruption — it reproduces the bug it is meant to fix.

## Environment

- mempalace `develop` @ `32ec74d` (v3.3.0), installed via `uv tool install --python 3.12`
- chromadb 1.5.8
- Python 3.12, macOS 15 (Darwin 25.4.0)
- Palace: 199,539 drawers, 1.7 GB on disk pre-migration, stable on chromadb 0.6.3 for weeks prior

## Variant A — `mempalace migrate --yes` (0.6.3 → 1.5.8 schema)

Migration ran ~1h 45min, wrote the pre-migration backup, and exited with zero errors. The final HNSW write step silently corrupted the new collection:

| File | On disk | Logical | State |
|---|---|---|---|
| `data_level0.bin` | 349 MB | 349 MB | OK |
| `link_lists.bin` | **647 GB** | **5.04 TB** | sparse, runaway |
| `index_metadata.pickle` | small | small | **truncated mid-write** |

Palace is unopenable: loading the pickle raises `_pickle.UnpicklingError: pickle data was truncated`. Migrate does not verify HNSW state or pickle integrity before returning success.

**Reproduce:** on a palace of ~100K+ drawers on chromadb 0.6.3, run `mempalace migrate --yes` to 1.5.x. Watch `du -sh ~/.mempalace/palace/*/link_lists.bin` during the final write phase.

## Variant B — `python -m mempalace.repair rebuild`

Runs without any schema migration. After quarantining the corrupt segment from Variant A (rename → `.drift-<date>`) and invoking `repair rebuild`, a new collection UUID was created and the upsert phase began:

- `link_lists.bin` reached **122 GB on disk / 223 GB logical** in under 20 min before a disk-usage watchdog tripped.
- `pkill -9 -f "mempalace.repair"` returned without confirming child-process death. Workers kept writing to the deleted file (open FDs), so disk was not reclaimed until a second `pkill` pass closed all FDs (~450 GB reclaim).

`repair rebuild` is the documented recovery path from HNSW corruption — but runs the same hnswlib writer that caused it. No escape hatch.

**Reproduce:** on any palace of ~50K+ drawers on chromadb 1.5.x, `python -m mempalace.repair rebuild`. Watch `du -sh ~/.mempalace/palace/*/link_lists.bin` during upsert.

## Related

- #344 — same mechanism via `mempalace mine` at 56K drawers. Documents the resize-drift analysis.
- #525 — closed, 0.6.3-era, `add()` vs `upsert()` on re-mine. Different trigger.
- #965 / #974 — 10.2 TB sparse `link_lists.bin` during `mine` on chromadb 1.x (threaded race).
- #722 — migrate schema incompleteness in the reverse direction (1.5.x → 0.6.x); different symptom.
- PR #346 — proposes `hnsw:batch_size=50000`, `hnsw:sync_threshold=50000` for `miner.py`. Does not touch `migrate.py` or `repair.py`.
- PR #976 — `hnsw:num_threads=1` + `mine_global_lock`. Addresses threading, not resize drift.
- Upstream: chroma-core/chroma#2594 (unbounded growth acknowledgement), chroma-core/chroma#6621 (`hnsw:initial_capacity`, unmerged), chroma-core/chroma#6926 (pickle RCE — confirms pickle still on load path).

## Suggested mitigations (mempalace-side, pending upstream fix)

1. Pre-flight disk check in `migrate` and `repair rebuild`: abort if free space < 10× palace size.
2. Live size guard during HNSW writes: if `link_lists.bin` exceeds N× `data_level0.bin`, stop and restore backup.
3. Post-op integrity check: open the new collection, confirm `count()` matches source and `index_metadata.pickle` unpickles, before reporting success.
4. Apply PR #346's `hnsw:batch_size` / `hnsw:sync_threshold` overrides at the collection-creation sites in `migrate.py` and `repair.py`, not only `miner.py`.
5. Fix watchdog termination in `repair.py`: after `pkill -9`, re-`pgrep` and retry before `rm`-ing any segment file (open-but-deleted FDs block reclaim).

## Current workaround

Restored pre-migration backup, pinned `chromadb==0.6.3` in `pyproject.toml`, reinstalled via `uv tool install --force`, re-applied the macOS CPU-embedding-provider patch (abandoned branch `fix/cpu-embedding-provider-macos`). Palace opens, `count() == 199540`. Will not run `migrate` or `repair rebuild` until this is resolved.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

migrate and repair rebuild trigger unbounded link_lists.bin growth on large palaces #1046

Summary

Environment

Variant A — `mempalace migrate --yes` (0.6.3 → 1.5.8 schema)

Variant B — `python -m mempalace.repair rebuild`

Related

Suggested mitigations (mempalace-side, pending upstream fix)

Current workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File	On disk	Logical	State
`data_level0.bin`	349 MB	349 MB	OK
`link_lists.bin`	647 GB	5.04 TB	sparse, runaway
`index_metadata.pickle`	small	small	truncated mid-write

migrate and repair rebuild trigger unbounded link_lists.bin growth on large palaces #1046

Description

Summary

Environment

Variant A — mempalace migrate --yes (0.6.3 → 1.5.8 schema)

Variant B — python -m mempalace.repair rebuild

Related

Suggested mitigations (mempalace-side, pending upstream fix)

Current workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Variant A — `mempalace migrate --yes` (0.6.3 → 1.5.8 schema)

Variant B — `python -m mempalace.repair rebuild`