Skip to content

Commit dea0742

Browse files
committed
Bugfix: index rename no longer fails when the destination name is already taken
`MoveEntryV2` used to run a bare UPDATE that hit the UNIQUE `(parent_id, name_folded)` constraint whenever a different row already occupied the destination (rename-with-overwrite, or a concurrent upsert racing ahead of the move message). The moved entry then stayed stuck at its old location, the same id failed repeatedly with `MoveEntryV2 update failed ... UNIQUE constraint failed` warns, and only per-navigation verification eventually healed the drift. - `handle_move_entry_v2` now resolves the destination first and, when a different entry owns it, deletes that row (subtree-aware, with full delta propagation via the existing `handle_delete_entry_by_id` / `handle_delete_subtree_by_id`) before moving. The on-disk truth after a rename is that the moved entry owns the destination name, so this preserves the moved entry's `id` and `dir_stats` instead of waiting for the verifier. - TDD'd: `move_entry_v2_destination_collision_replaces_conflicting_file` and `..._replaces_conflicting_dir_subtree` reproduce the collision and pin the replace semantics. - Also demoted the `stall_probe::sqlite_busy` busy-handler logging: per-attempt lines go to debug (brief contention from WAL checkpoints and long-lived readers is routine and was the noisiest warn in a normal dev session); attempt >= 20 (>100ms lock wait, a genuine stall signal) stays warn.
1 parent a412e59 commit dea0742

2 files changed

Lines changed: 227 additions & 4 deletions

File tree

apps/desktop/src-tauri/src/indexing/CLAUDE.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Design history is in git (former `docs/specs/drive-indexing/`).
2020
- **store.rs** -- SQLite schema (integer-keyed entries with `name_folded` column on all platforms, `inode` column for hardlink dedup, `dir_stats` by entry_id, `meta`), `platform_case` collation, read queries, DB open/migrate. `resolve_component` always queries by `(parent_id, name_folded)` using the `idx_parent_name_folded` composite **UNIQUE** index. On Linux/Windows, `normalize_for_comparison()` is the identity function, so `name_folded = name` and the index behaves identically to a `(parent_id, name)` index. Schema version check: mismatch triggers drop+rebuild. `has_sized_entry_for_inode()` checks if another entry with the same inode already has non-NULL sizes; `find_entry_by_inode()` returns the first row with a given inode (used by the live event loop's rename pre-pass). Both path-keyed (backward compat) and integer-keyed APIs.
2121
- **metadata.rs** -- `MetadataSnapshot` struct and `extract_metadata()` function. Single location for all platform-specific metadata extraction (logical/physical size, mtime, inode, nlink). Used by scanner, reconciler, verifier, and event_loop. Symlinks get `None` everywhere. Files get sizes + inode + nlink. Directories get inode but no sizes/nlink. The inode is what the live event loop's rename pre-pass matches against to detect dir renames in place.
2222
- **memory_watchdog.rs** -- Background task monitoring resident memory via `mach_task_info` (macOS). Warns at 8 GB, stops indexing at 16 GB, emits `index-memory-warning` event to frontend. No-op stub on non-macOS. Started from `start_indexing()`.
23-
- **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). `WRITER_GENERATION: AtomicU64` (initialized to 1) bumped on every mutation (`InsertEntriesV2`, `UpsertEntryV2`, `MoveEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `TruncateData`) for search index staleness detection. Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `MoveEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. `MoveEntryV2 { entry_id, new_parent_id, new_name }` updates an entry's `(parent_id, name, name_folded)` in place, preserving its `id` and (for directories) `dir_stats`. Same-parent renames don't change ancestor totals; cross-parent moves subtract the entry's contribution from the old ancestor chain and add it to the new one (and recompute the OR-aggregated `recursive_has_symlinks` flag on both chains). The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` auto-propagates deltas on both insert and update: on insert, propagates the full size (+file_count or +dir_count); on update, reads the old entry first and propagates only the size difference. This means callers never need a separate `PropagateDeltaById` for upserted entries. For new directories, also initializes a zero-valued `dir_stats` row so enrichment always has a row. Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + an `entries_inserted` counter), cleared on `TruncateData`. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. On `ComputePartialAggregates { hot_paths }` (mid-scan), `handle_compute_partial_aggregates` borrows the same maps **read-only** (no clear, no mutation, no generation bump), no-ops on empty maps with no SQL fallback, delegates the math to `aggregator::compute_partial_aggregates`, writes a depth-capped (`PARTIAL_AGG_MAX_DEPTH = 3`) subset of `dir_stats` rows, and emits `index-dir-updated { paths: ["/"] }` when an `AppHandle` is present. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). `IndexWriter::try_send` is a non-blocking send (`Ok(true)` enqueued / `Ok(false)` channel full, dropped / `Err` writer gone) with `queue_depth()` accessor over the channel-depth atomic; the bump/undo accounting lives in the extracted `try_send_with_depth` free function (undoes the bump on both `Full` and `Disconnected` so the depth never drifts). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle). No index drop/recreate dance. The `idx_parent_name_folded` composite index uses binary collation and stays present during scans.
23+
- **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). `WRITER_GENERATION: AtomicU64` (initialized to 1) bumped on every mutation (`InsertEntriesV2`, `UpsertEntryV2`, `MoveEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `TruncateData`) for search index staleness detection. Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `MoveEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. `MoveEntryV2 { entry_id, new_parent_id, new_name }` updates an entry's `(parent_id, name, name_folded)` in place, preserving its `id` and (for directories) `dir_stats`. If a different entry already occupies the destination `(parent_id, name_folded)` (rename-with-overwrite, or a concurrent upsert racing ahead of the move message), the handler deletes the conflicting row first (subtree-aware, with delta propagation) so the move never fails the UNIQUE constraint; the on-disk truth after a rename is that the moved entry owns the destination name. Same-parent renames don't change ancestor totals; cross-parent moves subtract the entry's contribution from the old ancestor chain and add it to the new one (and recompute the OR-aggregated `recursive_has_symlinks` flag on both chains). The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` auto-propagates deltas on both insert and update: on insert, propagates the full size (+file_count or +dir_count); on update, reads the old entry first and propagates only the size difference. This means callers never need a separate `PropagateDeltaById` for upserted entries. For new directories, also initializes a zero-valued `dir_stats` row so enrichment always has a row. Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + an `entries_inserted` counter), cleared on `TruncateData`. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. On `ComputePartialAggregates { hot_paths }` (mid-scan), `handle_compute_partial_aggregates` borrows the same maps **read-only** (no clear, no mutation, no generation bump), no-ops on empty maps with no SQL fallback, delegates the math to `aggregator::compute_partial_aggregates`, writes a depth-capped (`PARTIAL_AGG_MAX_DEPTH = 3`) subset of `dir_stats` rows, and emits `index-dir-updated { paths: ["/"] }` when an `AppHandle` is present. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). `IndexWriter::try_send` is a non-blocking send (`Ok(true)` enqueued / `Ok(false)` channel full, dropped / `Err` writer gone) with `queue_depth()` accessor over the channel-depth atomic; the bump/undo accounting lives in the extracted `try_send_with_depth` free function (undoes the bump on both `Full` and `Disconnected` so the depth never drifts). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle). No index drop/recreate dance. The `idx_parent_name_folded` composite index uses binary collation and stays present during scans.
2424
- **scanner.rs** -- jwalk-based parallel directory walker. `scan_volume()` for full scan, `scan_subtree()` for targeted subtree rescans (used by post-replay background verification). Uses `ScanContext` (from store.rs) to assign integer IDs and parent IDs during the walk: maintains a `HashMap<PathBuf, i64>` mapping directory paths to assigned IDs, with IDs allocated from the shared `Arc<AtomicI64>` counter owned by `IndexWriter`. The scan root is mapped to `ROOT_ID` (1). Sends `InsertEntriesV2(Vec<EntryRow>)` batches to the writer. Platform-specific exclusion filters via `should_exclude` (`pub(super)`), the single exclusion gate for all code paths (scanner, reconciler, event_loop verification, per-navigation verifier). E2E scan restriction: when `CMDR_E2E_START_PATH` is set, `should_exclude` restricts scanning to only the fixture path, its children, and ancestors. Everything else is excluded (critical for Docker E2E performance). `default_exclusions()` is `#[cfg(test)]` only. Physical sizes (`st_blocks * 512`). Hardlink inode dedup: files with `nlink > 1` are tracked in a `HashSet<u64>` by inode; only the first link's size is counted, subsequent links get `size = None`. Files with `nlink == 1` (vast majority) skip the set entirely. All files store `inode` in `EntryRow.inode` (from `MetadataExt::ino()` on Unix, `None` on non-Unix). Directories and symlinks get `inode: None`.
2525
- **aggregator.rs** -- Dir stats computation. Bottom-up after full scan (O(N) single pass), per-subtree after subtree rescans, incremental delta propagation up ancestor chain for watcher events. Two entry points for full aggregation: `compute_all_aggregates_reported` (loads maps from SQL) and `compute_all_aggregates_with_maps` (accepts pre-built maps from the writer). Both accept an `on_progress: &mut dyn FnMut(AggregationProgress)` callback and delegate to `compute_and_write()` for the shared topological sort + bottom-up computation + batch write. Progress is reported at phase transitions and every ~1% during compute/write loops. `AggregationPhase` enum: `SavingEntries` (flushing writer channel), `LoadingDirectories`, `Sorting`, `Computing`, `Writing`. The composite indexes use binary collation so there's no per-scan index rebuild phase. `compute_partial_aggregates` is the mid-scan variant: it derives the dir list and parent relations from the borrowed accumulator maps (no SQL `load_all_directory_ids` scan), computes each dir's depth from the scan root via a memoized walk (`depth(ROOT_ID) = 0` is the explicit base case; unreachable dirs get `usize::MAX` so the depth cap never writes them), reuses the same `topological_sort_bottom_up` + `compute_bottom_up` over **all** dirs, and writes only dirs at `depth ≤ max_depth` plus each resolvable hot-path dir and its direct children. `backfill_missing_dir_stats` is a catch-up pass that finds directories without `dir_stats` rows and computes their stats bottom-up; triggered after reconciler replay and cold-start replay via `BackfillMissingDirStats` writer message.
2626
- **watcher.rs** -- Drive-level filesystem watcher. macOS: FSEvents via `cmdr-fsevent-stream` with event IDs and `sinceWhen` replay. Linux: `notify` crate (inotify backend) with recursive watching and synthetic event counter. Other platforms: stub. `supports_event_replay()` lets callers branch on whether journal replay is available.
@@ -257,6 +257,10 @@ for unknown paths in Xs [total], sample: <path>`. Error reports retain the
257257
existence-of-drift signal. Live in `reconciler.rs::unknown_path_skips`. To get the
258258
per-event detail back: `RUST_LOG=cmdr_lib::indexing::reconciler=trace,cmdr_lib::indexing::writer=trace,debug`.
259259

260+
The writer's SQLite busy retry logger (`stall_probe::sqlite_busy` in `writer.rs::spawn`) logs
261+
per-attempt at DEBUG: brief contention is routine (WAL checkpoints, long-lived readers). It
262+
escalates to WARN at attempt >= 20 (>100ms of lock wait), which signals a genuine stall.
263+
260264
## Gotchas
261265

262266
**INSERT OR REPLACE on a populated DB is catastrophically slow**: The `platform_case` collation (NFD + case fold on macOS) runs for every B-tree comparison during unique index lookups. On an empty DB a full scan takes ~2.5 min; on a populated DB with 5.5M entries the same scan takes ~30 min because each `INSERT OR REPLACE` triggers ~20 collation calls to traverse the B-tree. `start_scan()` truncates `entries` and `dir_stats` via `TruncateData` + `flush_blocking()` before every scan to avoid this. Additionally, without truncation, old rows accumulate as orphaned subtrees (3-4x DB bloat per scan cycle) because `INSERT OR REPLACE` only deduplicates at the root level.

0 commit comments

Comments
 (0)