You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Indexing: reserve the UNIQUE-conflict WARN for the case that's actually actionable
The index scan logged a WARN for every batch that skipped a row on the `(parent_id, name_folded)` UNIQUE constraint, even a single skip. But a few skips per scan is expected dedup (one dir reachable by two walk paths via a firmlink/symlink, or case/NFD sibling pairs on case-sensitive or cross-OS-synced trees), not anything to act on. That made the WARN noise and trained the eye to ignore it, defeating the point of WARN.
Recalibrate so WARN means "do something":
- Per-batch skips drop to DEBUG (keeping the 3-row sample for diagnosis under `RUST_LOG=cmdr_lib::indexing::writer=debug`).
- `AccumulatorMaps` gains an `entries_skipped` tally; `handle_compute_all_aggregates` summarizes it once per scan via the new pure `classify_skip_severity`: silent when nothing skipped, DEBUG for sparse dedup, and WARN only when the skip ratio looks like two writers racing on one DB (≥50 skips AND >1% of the scan's rows). That racing case is the constraint's whole reason for being (a 1.83 TB ghost size was traced to it), and it's the one a reader should investigate.
Ratio over per-batch absolute count because the racing signature is a large *fraction* of rows skipped sustained across the scan, while a giant directory of genuine collisions could trip an absolute per-batch threshold falsely. The absolute floor keeps a tiny tree with a couple sibling collisions from warning.
Normal scans now log nothing here.
Copy file name to clipboardExpand all lines: apps/desktop/src-tauri/src/indexing/DETAILS.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ The key UX win: showing directory sizes in file listings. Design history is in g
22
22
-**store.rs** -- SQLite schema (integer-keyed entries with `name_folded` column on all platforms, `inode` column for hardlink dedup, `dir_stats` by entry_id, `meta`), `platform_case` collation, read queries, DB open/migrate. `resolve_component` always queries by `(parent_id, name_folded)` using the `idx_parent_name_folded` composite **UNIQUE** index. On Linux/Windows, `normalize_for_comparison()` is the identity function, so `name_folded = name` and the index behaves identically to a `(parent_id, name)` index. Schema version check: mismatch triggers drop+rebuild. `has_sized_entry_for_inode()` checks if another entry with the same inode already has non-NULL sizes; `find_entry_by_inode()` returns the first row with a given inode (used by the live event loop's rename pre-pass). Both path-keyed (backward compat) and integer-keyed APIs.
23
23
-**metadata.rs** -- `MetadataSnapshot` struct and `extract_metadata()` function. Single location for all platform-specific metadata extraction (logical/physical size, mtime, inode, nlink). Used by scanner, reconciler, verifier, and event_loop. Symlinks get `None` everywhere. Files get sizes + inode + nlink. Directories get inode but no sizes/nlink. The inode is what the live event loop's rename pre-pass matches against to detect dir renames in place.
24
24
-**memory_watchdog.rs** -- Background task monitoring resident memory via `mach_task_info` (macOS). Warns at 8 GB, stops indexing at 16 GB, emits `index-memory-warning` event to frontend. No-op stub on non-macOS. Started from `start_indexing()`.
25
-
- **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). `WRITER_GENERATION: AtomicU64` (initialized to 1) bumped on every mutation (`InsertEntriesV2`, `UpsertEntryV2`, `MoveEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `TruncateData`) for search index staleness detection. Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `MoveEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. `MoveEntryV2 { entry_id, new_parent_id, new_name }` updates an entry's `(parent_id, name, name_folded)` in place, preserving its `id` and (for directories) `dir_stats`. If a different entry already occupies the destination `(parent_id, name_folded)` (rename-with-overwrite, or a concurrent upsert racing ahead of the move message), the handler deletes the conflicting row first (subtree-aware, with delta propagation) so the move never fails the UNIQUE constraint; the on-disk truth after a rename is that the moved entry owns the destination name. Same-parent renames don't change ancestor totals; cross-parent moves subtract the entry's contribution from the old ancestor chain and add it to the new one (and recompute the OR-aggregated `recursive_has_symlinks` flag on both chains). The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` auto-propagates deltas on both insert and update: on insert, propagates the full size (+file_count or +dir_count); on update, reads the old entry first and propagates only the size difference. This means callers never need a separate `PropagateDeltaById` for upserted entries. For new directories, also initializes a zero-valued `dir_stats` row so enrichment always has a row. Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + an `entries_inserted` counter), cleared on `TruncateData`. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. On `ComputePartialAggregates { hot_paths }` (mid-scan), `handle_compute_partial_aggregates` borrows the same maps **read-only** (no clear, no mutation, no generation bump), no-ops on empty maps with no SQL fallback, delegates the math to `aggregator::compute_partial_aggregates`, writes a depth-capped (`PARTIAL_AGG_MAX_DEPTH = 3`) subset of `dir_stats` rows, and emits `index-dir-updated { paths: ["/"] }` when an `AppHandle` is present. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). `IndexWriter::try_send` is a non-blocking send (`Ok(true)` enqueued / `Ok(false)` channel full, dropped / `Err` writer gone) with `queue_depth()` accessor over the channel-depth atomic; the bump/undo accounting lives in the extracted `try_send_with_depth` free function (undoes the bump on both `Full` and `Disconnected` so the depth never drifts). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle). No index drop/recreate dance. The `idx_parent_name_folded` composite index uses binary collation and stays present during scans.
25
+
- **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). `WRITER_GENERATION: AtomicU64` (initialized to 1) bumped on every mutation (`InsertEntriesV2`, `UpsertEntryV2`, `MoveEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `TruncateData`) for search index staleness detection. Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `MoveEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. `MoveEntryV2 { entry_id, new_parent_id, new_name }` updates an entry's `(parent_id, name, name_folded)` in place, preserving its `id` and (for directories) `dir_stats`. If a different entry already occupies the destination `(parent_id, name_folded)` (rename-with-overwrite, or a concurrent upsert racing ahead of the move message), the handler deletes the conflicting row first (subtree-aware, with delta propagation) so the move never fails the UNIQUE constraint; the on-disk truth after a rename is that the moved entry owns the destination name. Same-parent renames don't change ancestor totals; cross-parent moves subtract the entry's contribution from the old ancestor chain and add it to the new one (and recompute the OR-aggregated `recursive_has_symlinks` flag on both chains). The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` auto-propagates deltas on both insert and update: on insert, propagates the full size (+file_count or +dir_count); on update, reads the old entry first and propagates only the size difference. This means callers never need a separate `PropagateDeltaById` for upserted entries. For new directories, also initializes a zero-valued `dir_stats` row so enrichment always has a row. Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + `entries_inserted` and `entries_skipped` counters), cleared on `TruncateData`. A per-batch `INSERT OR IGNORE` UNIQUE-conflict skip is logged at DEBUG only (with a 3-row sample) and tallied into `entries_skipped`; `handle_compute_all_aggregates` summarizes the scan-wide tally once via `classify_skip_severity` (none → silent, sparse dedup → DEBUG, racing-writer ratio (≥50 skips and >1% of rows) → WARN), so normal scans log nothing and only the actionable double-write case warns. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. On `ComputePartialAggregates { hot_paths }` (mid-scan), `handle_compute_partial_aggregates` borrows the same maps **read-only** (no clear, no mutation, no generation bump), no-ops on empty maps with no SQL fallback, delegates the math to `aggregator::compute_partial_aggregates`, writes a depth-capped (`PARTIAL_AGG_MAX_DEPTH = 3`) subset of `dir_stats` rows, and emits `index-dir-updated { paths: ["/"] }` when an `AppHandle` is present. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). `IndexWriter::try_send` is a non-blocking send (`Ok(true)` enqueued / `Ok(false)` channel full, dropped / `Err` writer gone) with `queue_depth()` accessor over the channel-depth atomic; the bump/undo accounting lives in the extracted `try_send_with_depth` free function (undoes the bump on both `Full` and `Disconnected` so the depth never drifts). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle). No index drop/recreate dance. The `idx_parent_name_folded` composite index uses binary collation and stays present during scans.
26
26
- **scanner.rs** -- jwalk-based parallel directory walker. `scan_volume()` for full scan, `scan_subtree()` for targeted subtree rescans (used by post-replay background verification). Uses `ScanContext` (from store.rs) to assign integer IDs and parent IDs during the walk: maintains a `HashMap<PathBuf, i64>` mapping directory paths to assigned IDs, with IDs allocated from the shared `Arc<AtomicI64>` counter owned by `IndexWriter`. The scan root is mapped to `ROOT_ID` (1). Sends `InsertEntriesV2(Vec<EntryRow>)` batches to the writer. Platform-specific exclusion filters via `should_exclude` (`pub(super)`), the single exclusion gate for all code paths (scanner, reconciler, event_loop verification, per-navigation verifier). E2E scan restriction: when `CMDR_E2E_START_PATH` is set, `should_exclude` restricts scanning to only the fixture path, its children, and ancestors. Everything else is excluded (critical for Docker E2E performance). `default_exclusions()` is `#[cfg(test)]` only. Physical sizes (`st_blocks * 512`). Hardlink inode dedup: files with `nlink > 1` are tracked in a `HashSet<u64>` by inode; only the first link's size is counted, subsequent links get `size = None`. Files with `nlink == 1` (vast majority) skip the set entirely. All files store `inode` in `EntryRow.inode` (from `MetadataExt::ino()` on Unix, `None` on non-Unix). Directories and symlinks get `inode: None`.
27
27
- **aggregator.rs** -- Dir stats computation. Bottom-up after full scan (O(N) single pass), per-subtree after subtree rescans, incremental delta propagation up ancestor chain for watcher events. Two entry points for full aggregation: `compute_all_aggregates_reported` (loads maps from SQL) and `compute_all_aggregates_with_maps` (accepts pre-built maps from the writer). Both accept an `on_progress: &mut dyn FnMut(AggregationProgress)` callback and delegate to `compute_and_write()` for the shared topological sort + bottom-up computation + batch write. Progress is reported at phase transitions and every ~1% during compute/write loops. `AggregationPhase` enum: `SavingEntries` (flushing writer channel), `LoadingDirectories`, `Sorting`, `Computing`, `Writing`. The composite indexes use binary collation so there's no per-scan index rebuild phase. `compute_partial_aggregates` is the mid-scan variant: it derives the dir list and parent relations from the borrowed accumulator maps (no SQL `load_all_directory_ids` scan), computes each dir's depth from the scan root via a memoized walk (`depth(ROOT_ID) = 0` is the explicit base case; unreachable dirs get `usize::MAX` so the depth cap never writes them), reuses the same `topological_sort_bottom_up` + `compute_bottom_up` over **all** dirs, and writes only dirs at `depth ≤ max_depth` plus each resolvable hot-path dir and its direct children. `backfill_missing_dir_stats` is a catch-up pass that finds directories without `dir_stats` rows and computes their stats bottom-up; triggered after reconciler replay and cold-start replay via `BackfillMissingDirStats` writer message.
28
28
-**watcher.rs** -- Drive-level filesystem watcher. macOS: FSEvents via `cmdr-fsevent-stream` with event IDs and `sinceWhen` replay. Linux: `notify` crate (inotify backend) with recursive watching and synthetic event counter. Other platforms: stub. `supports_event_replay()` lets callers branch on whether journal replay is available.
0 commit comments