fix: guard against data loss in repair, migrate, and CLI rebuild#935
fix: guard against data loss in repair, migrate, and CLI rebuild#935igorls merged 3 commits intoMemPalace:developfrom
Conversation
- repair.py: wrap upsert loop in try/except; restore from backup on failure instead of leaving a partially rebuilt collection - migrate.py: replace non-atomic rmtree+move with rename-aside swap so a crash between the two calls does not destroy both copies - cli.py: use offset += len(batch["ids"]) with empty-batch guard instead of fixed offset += batch_size to prevent skipping drawers
- repair.py: define backup_path before the conditional block so it is always in scope when the except handler references it - migrate.py: restore old palace from .old if both os.rename and shutil.move fail during the swap step
mvalentsev
left a comment
There was a problem hiding this comment.
Finding 1 (repair crash safety) overlaps with open PR #239 (rusel95), which rewrites repair to a migrate-to-fresh-palace strategy: reads all data via get(), writes to a temp palace, verifies counts match, then atomically swaps directories. Original palace preserved as .backup with auto-restore on failure.
Findings 2 and 3 look clean to me.
|
Correction to my earlier review: Finding 2 (migrate swap): PR #890 touches the same swap code path in migrate.py, adding _replace_palace_dir() with retry logic. Different focus (Windows handle cleanup) but related. Finding 3 (cli.py offset): PRs #239 and #632 both rewrite the repair code path entirely (migrate-to-fresh strategy and nuke-rebuild respectively). If either lands, the extraction loop with the offset bug no longer exists. |
|
Hi, In migrate(), if the cross-device fallback (shutil.move) partially creates the destination directory and then fails, the rollback os.rename(stale_path, palace_path) will also fail because palace_path already exists, leaving a partial migrated palace at the target and the original only at .old. Severity: action required | Category: reliability How to fix: Harden rollback and cleanup Agent prompt to fix - you can give this to your LLM of choice:
Found by Qodo code review |
shutil.move() can partially create palace_path before raising, which would trip a bare os.replace(stale_path, palace_path) rollback (dest exists). - Switch the primary swap to os.replace so same-filesystem moves stay atomic - Branch on errno.EXDEV before falling back to shutil.move, so real errors (permissions, EIO) surface instead of silently attempting a slow copy - Extract rollback into _restore_stale_palace which clears any partial destination and, if the restore itself fails, logs both stale_path and palace_path so the operator can recover by hand Adds three regression tests covering clean rollback, partial-copy cleanup, and logged failure on rollback-failure. Flagged by the Qodo reviewer on MemPalace#935.
|
Pushed 659cb81 addressing both pieces of review feedback. @qodo-ai-reviewer — rollback hardening:
@mvalentsev — overlap notes:
Also split the parent #934 into six child issues (#1151–#1156; finding 6 withdrawn). Filed corresponding child references: #1151 / #1152 / #1153. Question for maintainers (@bensig): bensig asked on #934 that each finding get its own PR. This PR bundles three fixes under the "crash-safety" theme in three independent files. Happy to split into three per-child-issue PRs if preferred — say the word and I'll carve this up. |
shutil.move() can partially create palace_path before raising, which would trip a bare os.replace(stale_path, palace_path) rollback (dest exists). - Switch the primary swap to os.replace so same-filesystem moves stay atomic - Branch on errno.EXDEV before falling back to shutil.move, so real errors (permissions, EIO) surface instead of silently attempting a slow copy - Extract rollback into _restore_stale_palace which clears any partial destination and, if the restore itself fails, logs both stale_path and palace_path so the operator can recover by hand Adds three regression tests covering clean rollback, partial-copy cleanup, and logged failure on rollback-failure. Flagged by the Qodo reviewer on MemPalace#935.
Summary
Three data-integrity fixes honoring CLAUDE.md's design principle: "A crash mid-operation must leave the existing palace untouched."
Refs: #934 (parent) — specifically #1151, #1152, #1153.
What changed
mempalace/repair.py— restore from backup on rebuild failure (#1151)rebuild_indexcallsdelete_collectionthen upserts drawers in a loop. If any upsert fails partway (OOM, ChromaDB error), the rest of the drawers are gone. The backup copy exists but is never consulted.mempalace/migrate.py— atomic palace swap (#1152)shutil.rmtree(palace_path)followed byshutil.move(...)has a window where both copies are gone. Replaced with a rename-aside pattern. The second commit in this PR (659cb81) hardens the cross-device fallback per @qodo-ai-reviewer's feedback:os.replace(same-filesystem atomic)errno.EXDEVgates theshutil.movefallback so unrelatedOSErrors surface instead of getting masked_restore_stale_palaceclears any partial destination before the rollbackos.replace, and logs both paths if restore itself failsmempalace/cli.py— defensive batch extraction offset (#1153)cmd_repairextraction usedoffset += batch_size(fixed 5000) regardless of actual batch length. Changed tooffset += len(batch["ids"])with an empty-batch guard, matching the pattern atexporter.py:70.Test plan
ruff check mempalace/— passedruff format --check mempalace/— passedpytest tests/test_migrate.py— 5 passed (including 3 new regression tests covering the rollback hardening)Overlap with open PRs
Per @mvalentsev's review:
repair.pyandcli.pyhunks here become moot. Until then, they prevent data loss on the current code path.Split question
@bensig asked on #934 that each finding get its own PR. This bundles three under the "crash-safety" theme in three independent files. Happy to split into three per-issue PRs if preferred.