Skip to content

Commit 44fecd6

Browse files
committed
Indexing: Make replays crazy fast!
Added `name_folded` for O(log n) lookups. `resolve_component` was doing O(n) sibling scans after the v5 composite index removal. Add a pre-computed `name_folded` column (macOS only) storing `normalize_for_comparison(name)`, indexed as `(parent_id, name_folded)` with binary collation. On Linux/Windows, index `(parent_id, name)` directly since `platform_case` is binary there. - Schema bumped to v6 (forces DB rebuild, it's a cache) - `resolve_component` now does a direct B-tree lookup on both platforms - `insert_entry_v2` and `insert_entries_v2_batch` compute `name_folded` at insert time on macOS - Old `idx_parent(parent_id)` replaced by the composite indexes - `rename_entry` (test-only) also updates `name_folded` - Three new macOS-specific tests: case-insensitive resolution, `name_folded` population on single and batch insert
1 parent e8762be commit 44fecd6

2 files changed

Lines changed: 183 additions & 34 deletions

File tree

apps/desktop/src-tauri/src/indexing/CLAUDE.md

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,11 @@ Full design: `docs/specs/drive-indexing/plan.md`
1212
- **enrichment.rs** -- `ReadPool` (lock-free thread-local read connections for enrichment and verification), `enrich_entries_with_index()` (called every `get_file_range`). Integer-keyed fast path: resolve parent dir once → batch-fetch child dir stats by ID → match by name. Falls back to individual path resolution for edge cases.
1313
- **event_loop.rs** -- `run_live_event_loop` (real-time FSEvents/inotify processing after scan completes), `run_replay_event_loop` (cold-start journal replay with two-phase approach), `run_background_verification` (post-replay bidirectional readdir diff), `merge_fs_events` (deduplication with flag priority), `process_live_batch`. All bounded-buffer constants live here.
1414
- **events.rs** -- Tauri event payload structs (`IndexScanStartedEvent`, `IndexScanProgressEvent`, `IndexScanCompleteEvent`, `IndexDirUpdatedEvent`, `IndexReplayProgressEvent`), `RescanReason` enum, `emit_rescan_notification()`, IPC response types (`IndexStatusResponse`, `IndexDebugStatusResponse`).
15-
- **store.rs** -- SQLite schema v5 (integer-keyed entries, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. `resolve_path`/`resolve_component` query all children by `parent_id` and match names in Rust via `platform_case_compare` (no SQL collation in the lookup). Schema version check: mismatch triggers drop+rebuild. Both path-keyed (backward compat) and integer-keyed APIs.
15+
- **store.rs** -- SQLite schema v6 (integer-keyed entries with `name_folded` column on macOS, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. `resolve_component` uses the composite index directly: on macOS queries by `(parent_id, name_folded)`, on Linux/Windows by `(parent_id, name)`. Schema version check: mismatch triggers drop+rebuild. Both path-keyed (backward compat) and integer-keyed APIs.
1616
- **memory_watchdog.rs** -- Background task monitoring resident memory via `mach_task_info` (macOS). Warns at 8 GB, stops indexing at 16 GB, emits `index-memory-warning` event to frontend. No-op stub on non-macOS. Started from `start_indexing()`.
17-
- **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` initializes a zero-valued `dir_stats` row when inserting a NEW directory, so enrichment always has a row (subsequent `PropagateDeltaById` calls update it incrementally). Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + an `entries_inserted` counter), cleared on `TruncateData`. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle). No index drop/recreate dance — `idx_parent` is a simple integer index that stays present during scans.
17+
- **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` initializes a zero-valued `dir_stats` row when inserting a NEW directory, so enrichment always has a row (subsequent `PropagateDeltaById` calls update it incrementally). Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + an `entries_inserted` counter), cleared on `TruncateData`. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle). No index drop/recreate dance — the composite indexes (`idx_parent_name_folded` on macOS, `idx_parent_name` on Linux) use binary collation and stay present during scans.
1818
- **scanner.rs** -- jwalk-based parallel directory walker. `scan_volume()` for full scan, `scan_subtree()` for targeted subtree rescans (used by post-replay background verification). Uses `ScanContext` (from store.rs) to assign integer IDs and parent IDs during the walk: maintains a `HashMap<PathBuf, i64>` mapping directory paths to assigned IDs. The scan root is mapped to `ROOT_ID` (1). Sends `InsertEntriesV2(Vec<EntryRow>)` batches to the writer. Platform-specific exclusion filters (macOS system paths, Linux virtual filesystems). Physical sizes (`st_blocks * 512`).
19-
- **aggregator.rs** -- Dir stats computation. Bottom-up after full scan (O(N) single pass), per-subtree after subtree rescans, incremental delta propagation up ancestor chain for watcher events. Two entry points for full aggregation: `compute_all_aggregates_reported` (loads maps from SQL) and `compute_all_aggregates_with_maps` (accepts pre-built maps from the writer). Both accept an `on_progress: &mut dyn FnMut(AggregationProgress)` callback and delegate to `compute_and_write()` for the shared topological sort + bottom-up computation + batch write. Progress is reported at phase transitions and every ~1% during compute/write loops. `AggregationPhase` enum: `SavingEntries` (flushing writer channel), `LoadingDirectories`, `Sorting`, `Computing`, `Writing`. (The former `RebuildingIndex` phase was removed when the composite `idx_parent_name` index was replaced with a simple `idx_parent` integer index.) `backfill_missing_dir_stats` is a catch-up pass that finds directories without `dir_stats` rows and computes their stats bottom-up; triggered after reconciler replay and cold-start replay via `BackfillMissingDirStats` writer message.
19+
- **aggregator.rs** -- Dir stats computation. Bottom-up after full scan (O(N) single pass), per-subtree after subtree rescans, incremental delta propagation up ancestor chain for watcher events. Two entry points for full aggregation: `compute_all_aggregates_reported` (loads maps from SQL) and `compute_all_aggregates_with_maps` (accepts pre-built maps from the writer). Both accept an `on_progress: &mut dyn FnMut(AggregationProgress)` callback and delegate to `compute_and_write()` for the shared topological sort + bottom-up computation + batch write. Progress is reported at phase transitions and every ~1% during compute/write loops. `AggregationPhase` enum: `SavingEntries` (flushing writer channel), `LoadingDirectories`, `Sorting`, `Computing`, `Writing`. (The former `RebuildingIndex` phase was removed when the composite `idx_parent_name` index with `platform_case` collation was replaced — now uses binary-collation composite indexes that don't need rebuilding.) `backfill_missing_dir_stats` is a catch-up pass that finds directories without `dir_stats` rows and computes their stats bottom-up; triggered after reconciler replay and cold-start replay via `BackfillMissingDirStats` writer message.
2020
- **watcher.rs** -- Drive-level filesystem watcher. macOS: FSEvents via `cmdr-fsevent-stream` with event IDs and `sinceWhen` replay. Linux: `notify` crate (inotify backend) with recursive watching and synthetic event counter. Other platforms: stub. `supports_event_replay()` lets callers branch on whether journal replay is available.
2121
- **reconciler.rs** -- Buffers FSEvents during scan (capped at 500K events; overflow sets `buffer_overflow` flag forcing full rescan), replays after scan completes using event IDs to skip stale events. Processes live events for file creates/removes/modifies using integer-keyed write messages (`UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`). Resolves filesystem paths to entry IDs via `store::resolve_path()` using a read connection passed by callers. Key functions (`process_fs_event`, `emit_dir_updated`) are `pub(super)` so `mod.rs` can call them directly during cold-start replay. `reconcile_subtree()` handles MustScanSubDirs by diffing filesystem vs DB directory-by-directory instead of delete-then-reinsert, making it safe to interrupt at any point.
2222
- **firmlinks.rs** -- Parses `/usr/share/firmlinks`, builds prefix map, normalizes paths. Converts `/System/Volumes/Data/Users/foo` to `/Users/foo`.
@@ -75,14 +75,17 @@ All writes go through a dedicated `std::thread` via a bounded `sync_channel` (20
7575

7676
Reads happen on separate WAL connections (any thread). A `ReadPool` provides thread-local read connections for enrichment and verification without contending on the `INDEXING` state-machine mutex.
7777

78-
### SQLite schema (v5: integer-keyed, simple parent index)
78+
### SQLite schema (v6: integer-keyed, platform-conditional composite index)
7979

8080
One DB per volume. **Dev and prod use separate directories** (see AGENTS.md § Debugging):
8181
- **Prod**: `~/Library/Application Support/com.veszelovszki.cmdr/index-{volume_id}.db`
8282
- **Dev**: `~/Library/Application Support/com.veszelovszki.cmdr-dev/index-{volume_id}.db`
8383

8484
Three tables:
85-
- `entries` (id INTEGER PK, parent_id, name COLLATE platform_case, is_directory, is_symlink, size, modified_at) with index `idx_parent(parent_id)`. Root sentinel: id=1, parent_id=0, name="".
85+
- `entries` (id INTEGER PK, parent_id, name COLLATE platform_case, [name_folded on macOS], is_directory, is_symlink, size, modified_at). Root sentinel: id=1, parent_id=0, name="".
86+
- **macOS**: has a `name_folded TEXT NOT NULL` column storing `normalize_for_comparison(name)` (NFD + case fold). Index: `idx_parent_name_folded ON entries (parent_id, name_folded)`.
87+
- **Linux/Windows**: no `name_folded` column. Index: `idx_parent_name ON entries (parent_id, name)`.
88+
- The old `idx_parent(parent_id)` from v5 is removed; the composite indexes replace it.
8689
- `dir_stats` (entry_id INTEGER PK, recursive_size, recursive_file_count, recursive_dir_count)
8790
- `meta` (key TEXT PK, value TEXT) WITHOUT ROWID
8891

@@ -91,7 +94,8 @@ WAL mode, 16 MB page cache, `auto_vacuum = INCREMENTAL` (free pages reclaimed vi
9194
History of changes:
9295
- **Schema v3**: Bumped from v2 to force DB rebuild after fixing orphan entry bug. Scanner, writer, aggregator, reconciler, enrichment, and IPC commands all fully migrated to integer keys. Enrichment uses integer-keyed fast path: resolve parent once → batch child dir stats by ID. Reconciler sends integer-keyed messages exclusively. Old path-keyed `WriteMessage` variants and backward-compat shims (`ScannedEntry`, `DirStats`) still exist for post-replay verification — cleanup in milestone 6.
9396
- **Schema v4**: Bumped from v3 to enable `auto_vacuum = INCREMENTAL` (requires DB rebuild since the pragma must be set before table creation).
94-
- **Schema v5**: Replaced composite `UNIQUE INDEX idx_parent_name(parent_id, name)` with simple `INDEX idx_parent(parent_id)`. The composite index with `platform_case` collation was extremely slow to build (~25 min for 5.1M entries). A simple integer index needs no drop/recreate dance during scans. The `platform_case` collation on the `name` column definition stays but `resolve_path`/`resolve_component` now query all children by `parent_id` and match names in Rust via `platform_case_compare`.
97+
- **Schema v5**: Replaced composite `UNIQUE INDEX idx_parent_name(parent_id, name)` with simple `INDEX idx_parent(parent_id)`. The composite index with `platform_case` collation was extremely slow to build (~25 min for 5.1M entries). A simple integer index needs no drop/recreate dance during scans.
98+
- **Schema v6**: Added `name_folded` column (macOS only) storing pre-computed `normalize_for_comparison(name)`. Replaced `idx_parent` with platform-conditional composite indexes: `idx_parent_name_folded(parent_id, name_folded)` on macOS, `idx_parent_name(parent_id, name)` on Linux/Windows. `resolve_component` now queries the index directly instead of fetching all children and matching in Rust.
9599

96100
## How to test
97101

@@ -125,6 +129,8 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
125129

126130
**In-memory accumulation eliminates aggregation SQL queries**: During a full scan, the writer thread accumulates two HashMaps in `AccumulatorMaps` as `InsertEntriesV2` batches arrive: `direct_stats` (parent_id -> file size/count/dir count) and `child_dirs` (parent_id -> child dir IDs). When `ComputeAllAggregates` fires, these maps are passed to `compute_all_aggregates_with_maps()`, skipping the two expensive full-table-scan SQL queries (`bulk_get_children_stats_by_id` and `bulk_get_child_dir_ids`) that previously dominated aggregation time (~70%). Maps are cleared on `TruncateData` and after aggregation completes. Falls back to SQL queries if maps are empty.
127131

132+
**Pre-computed `name_folded` instead of SQL collation in the index (macOS)**: The old composite index `idx_parent_name(parent_id, name)` with `platform_case` collation took ~25 min to build for 5.1M entries because every B-tree comparison invoked NFD + case fold. The v5 workaround (simple `idx_parent` + match in Rust) required fetching all children per parent. `name_folded` stores the pre-computed `normalize_for_comparison(name)` at insert time, so the composite index uses binary collation and builds in seconds. `resolve_component` gets O(log n) lookups via a single indexed query.
133+
128134
**Subtree aggregation uses scoped queries**: `scoped_get_children_stats_by_id` and `scoped_get_child_dir_ids` in `aggregator.rs` use recursive CTEs scoped to the target subtree, not full-table scans. This keeps subtree aggregation O(subtree_size) regardless of total DB size.
129135

130136
**Bounded buffers prevent OOM**: All buffers have capacity limits. FSEvents channel: 32K batches (bounded `try_send` in cmdr-fsevent-stream; overflow sets atomic flag, triggers rescan). Reconciler buffer: 500K events (overflow triggers full rescan). Writer channel: 20K messages (bounded `sync_channel`, backpressure). Replay `affected_paths`: 50K entries (overflow emits full refresh). Replay `pending_rescans`: 1K entries (overflow triggers full rescan). Replay event count: 1M events max (overflow falls back to full scan). Memory watchdog: warns at 8 GB, stops indexing at 16 GB. The index is a disposable cache, so dropping events and rescanning is always safe.
@@ -175,6 +181,6 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
175181

176182
**Reconciler must delete old subtree on dir-to-file type changes**: When `reconcile_subtree` matches a filesystem entry to a DB entry by name, it must check if `is_directory` changed. If a directory became a file, `DeleteSubtreeById` must be sent before `UpsertEntryV2`. Without this, `INSERT OR REPLACE` keeps the same row ID (same `parent_id + name`), and the old directory's children become logical orphans — entries parented by a file.
177183

178-
**Scanner's `insert_entries_v2_batch` uses plain `INSERT`**: With the old `idx_parent_name` unique index, `INSERT OR REPLACE` would silently delete the old row and insert a new one with a new ID, orphaning all children. That unique index is gone (replaced by `idx_parent`), and the only unique constraint is the integer PK (`id`). Since `ScanContext` assigns unique IDs and the table is truncated before full scans (or descendants deleted before subtree scans), PK conflicts shouldn't occur. The batch insert uses plain `INSERT` to reflect this.
184+
**Scanner's `insert_entries_v2_batch` uses plain `INSERT`**: With the old `idx_parent_name` unique index, `INSERT OR REPLACE` would silently delete the old row and insert a new one with a new ID, orphaning all children. That unique index is gone (replaced by `idx_parent_name_folded` on macOS / `idx_parent_name` on Linux), and the only unique constraint is the integer PK (`id`). Since `ScanContext` assigns unique IDs and the table is truncated before full scans (or descendants deleted before subtree scans), PK conflicts shouldn't occur. The batch insert uses plain `INSERT` to reflect this.
179185

180186
**IndexWriter exposes `db_path()`**: The scanner needs the DB path to open a temporary connection for `ScanContext::new()`. This path is stored on the `IndexWriter` handle and accessible via `db_path()`. The temporary connection is short-lived (only used to read `MAX(id)`).

0 commit comments

Comments
 (0)