You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added `name_folded` for O(log n) lookups.
`resolve_component` was doing O(n) sibling scans after the v5 composite index removal. Add a pre-computed `name_folded` column (macOS only) storing `normalize_for_comparison(name)`, indexed as `(parent_id, name_folded)` with binary collation. On Linux/Windows, index `(parent_id, name)` directly since `platform_case` is binary there.
- Schema bumped to v6 (forces DB rebuild, it's a cache)
- `resolve_component` now does a direct B-tree lookup on both platforms
- `insert_entry_v2` and `insert_entries_v2_batch` compute `name_folded` at insert time on macOS
- Old `idx_parent(parent_id)` replaced by the composite indexes
- `rename_entry` (test-only) also updates `name_folded`
- Three new macOS-specific tests: case-insensitive resolution, `name_folded` population on single and batch insert
Copy file name to clipboardExpand all lines: apps/desktop/src-tauri/src/indexing/CLAUDE.md
+13-7Lines changed: 13 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,11 +12,11 @@ Full design: `docs/specs/drive-indexing/plan.md`
12
12
-**enrichment.rs** -- `ReadPool` (lock-free thread-local read connections for enrichment and verification), `enrich_entries_with_index()` (called every `get_file_range`). Integer-keyed fast path: resolve parent dir once → batch-fetch child dir stats by ID → match by name. Falls back to individual path resolution for edge cases.
13
13
-**event_loop.rs** -- `run_live_event_loop` (real-time FSEvents/inotify processing after scan completes), `run_replay_event_loop` (cold-start journal replay with two-phase approach), `run_background_verification` (post-replay bidirectional readdir diff), `merge_fs_events` (deduplication with flag priority), `process_live_batch`. All bounded-buffer constants live here.
-**store.rs** -- SQLite schema v5 (integer-keyed entries, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. `resolve_path`/`resolve_component`query all children by `parent_id` and match names in Rust via `platform_case_compare` (no SQL collation in the lookup). Schema version check: mismatch triggers drop+rebuild. Both path-keyed (backward compat) and integer-keyed APIs.
15
+
-**store.rs** -- SQLite schema v6 (integer-keyed entries with `name_folded` column on macOS, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. `resolve_component`uses the composite index directly: on macOS queries by `(parent_id, name_folded)`, on Linux/Windows by `(parent_id, name)`. Schema version check: mismatch triggers drop+rebuild. Both path-keyed (backward compat) and integer-keyed APIs.
16
16
-**memory_watchdog.rs** -- Background task monitoring resident memory via `mach_task_info` (macOS). Warns at 8 GB, stops indexing at 16 GB, emits `index-memory-warning` event to frontend. No-op stub on non-macOS. Started from `start_indexing()`.
17
-
- **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` initializes a zero-valued `dir_stats` row when inserting a NEW directory, so enrichment always has a row (subsequent `PropagateDeltaById` calls update it incrementally). Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + an `entries_inserted` counter), cleared on `TruncateData`. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle). No index drop/recreate dance — `idx_parent` is a simple integer index that stays present during scans.
17
+
- **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` initializes a zero-valued `dir_stats` row when inserting a NEW directory, so enrichment always has a row (subsequent `PropagateDeltaById` calls update it incrementally). Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + an `entries_inserted` counter), cleared on `TruncateData`. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle). No index drop/recreate dance — the composite indexes (`idx_parent_name_folded` on macOS, `idx_parent_name` on Linux) use binary collation and stay present during scans.
18
18
-**scanner.rs** -- jwalk-based parallel directory walker. `scan_volume()` for full scan, `scan_subtree()` for targeted subtree rescans (used by post-replay background verification). Uses `ScanContext` (from store.rs) to assign integer IDs and parent IDs during the walk: maintains a `HashMap<PathBuf, i64>` mapping directory paths to assigned IDs. The scan root is mapped to `ROOT_ID` (1). Sends `InsertEntriesV2(Vec<EntryRow>)` batches to the writer. Platform-specific exclusion filters (macOS system paths, Linux virtual filesystems). Physical sizes (`st_blocks * 512`).
19
-
- **aggregator.rs** -- Dir stats computation. Bottom-up after full scan (O(N) single pass), per-subtree after subtree rescans, incremental delta propagation up ancestor chain for watcher events. Two entry points for full aggregation: `compute_all_aggregates_reported` (loads maps from SQL) and `compute_all_aggregates_with_maps` (accepts pre-built maps from the writer). Both accept an `on_progress: &mut dyn FnMut(AggregationProgress)` callback and delegate to `compute_and_write()` for the shared topological sort + bottom-up computation + batch write. Progress is reported at phase transitions and every ~1% during compute/write loops. `AggregationPhase` enum: `SavingEntries` (flushing writer channel), `LoadingDirectories`, `Sorting`, `Computing`, `Writing`. (The former `RebuildingIndex` phase was removed when the composite `idx_parent_name` index was replaced with a simple `idx_parent` integer index.) `backfill_missing_dir_stats` is a catch-up pass that finds directories without `dir_stats` rows and computes their stats bottom-up; triggered after reconciler replay and cold-start replay via `BackfillMissingDirStats` writer message.
19
+
- **aggregator.rs** -- Dir stats computation. Bottom-up after full scan (O(N) single pass), per-subtree after subtree rescans, incremental delta propagation up ancestor chain for watcher events. Two entry points for full aggregation: `compute_all_aggregates_reported` (loads maps from SQL) and `compute_all_aggregates_with_maps` (accepts pre-built maps from the writer). Both accept an `on_progress: &mut dyn FnMut(AggregationProgress)` callback and delegate to `compute_and_write()` for the shared topological sort + bottom-up computation + batch write. Progress is reported at phase transitions and every ~1% during compute/write loops. `AggregationPhase` enum: `SavingEntries` (flushing writer channel), `LoadingDirectories`, `Sorting`, `Computing`, `Writing`. (The former `RebuildingIndex` phase was removed when the composite `idx_parent_name` index with `platform_case` collation was replaced — now uses binary-collation composite indexes that don't need rebuilding.) `backfill_missing_dir_stats` is a catch-up pass that finds directories without `dir_stats` rows and computes their stats bottom-up; triggered after reconciler replay and cold-start replay via `BackfillMissingDirStats` writer message.
20
20
-**watcher.rs** -- Drive-level filesystem watcher. macOS: FSEvents via `cmdr-fsevent-stream` with event IDs and `sinceWhen` replay. Linux: `notify` crate (inotify backend) with recursive watching and synthetic event counter. Other platforms: stub. `supports_event_replay()` lets callers branch on whether journal replay is available.
21
21
-**reconciler.rs** -- Buffers FSEvents during scan (capped at 500K events; overflow sets `buffer_overflow` flag forcing full rescan), replays after scan completes using event IDs to skip stale events. Processes live events for file creates/removes/modifies using integer-keyed write messages (`UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`). Resolves filesystem paths to entry IDs via `store::resolve_path()` using a read connection passed by callers. Key functions (`process_fs_event`, `emit_dir_updated`) are `pub(super)` so `mod.rs` can call them directly during cold-start replay. `reconcile_subtree()` handles MustScanSubDirs by diffing filesystem vs DB directory-by-directory instead of delete-then-reinsert, making it safe to interrupt at any point.
@@ -75,14 +75,17 @@ All writes go through a dedicated `std::thread` via a bounded `sync_channel` (20
75
75
76
76
Reads happen on separate WAL connections (any thread). A `ReadPool` provides thread-local read connections for enrichment and verification without contending on the `INDEXING` state-machine mutex.
-`entries` (id INTEGER PK, parent_id, name COLLATE platform_case, is_directory, is_symlink, size, modified_at) with index `idx_parent(parent_id)`. Root sentinel: id=1, parent_id=0, name="".
85
+
-`entries` (id INTEGER PK, parent_id, name COLLATE platform_case, [name_folded on macOS], is_directory, is_symlink, size, modified_at). Root sentinel: id=1, parent_id=0, name="".
86
+
-**macOS**: has a `name_folded TEXT NOT NULL` column storing `normalize_for_comparison(name)` (NFD + case fold). Index: `idx_parent_name_folded ON entries (parent_id, name_folded)`.
87
+
-**Linux/Windows**: no `name_folded` column. Index: `idx_parent_name ON entries (parent_id, name)`.
88
+
- The old `idx_parent(parent_id)` from v5 is removed; the composite indexes replace it.
-**Schema v3**: Bumped from v2 to force DB rebuild after fixing orphan entry bug. Scanner, writer, aggregator, reconciler, enrichment, and IPC commands all fully migrated to integer keys. Enrichment uses integer-keyed fast path: resolve parent once → batch child dir stats by ID. Reconciler sends integer-keyed messages exclusively. Old path-keyed `WriteMessage` variants and backward-compat shims (`ScannedEntry`, `DirStats`) still exist for post-replay verification — cleanup in milestone 6.
93
96
-**Schema v4**: Bumped from v3 to enable `auto_vacuum = INCREMENTAL` (requires DB rebuild since the pragma must be set before table creation).
94
-
-**Schema v5**: Replaced composite `UNIQUE INDEX idx_parent_name(parent_id, name)` with simple `INDEX idx_parent(parent_id)`. The composite index with `platform_case` collation was extremely slow to build (~25 min for 5.1M entries). A simple integer index needs no drop/recreate dance during scans. The `platform_case` collation on the `name` column definition stays but `resolve_path`/`resolve_component` now query all children by `parent_id` and match names in Rust via `platform_case_compare`.
97
+
-**Schema v5**: Replaced composite `UNIQUE INDEX idx_parent_name(parent_id, name)` with simple `INDEX idx_parent(parent_id)`. The composite index with `platform_case` collation was extremely slow to build (~25 min for 5.1M entries). A simple integer index needs no drop/recreate dance during scans.
98
+
-**Schema v6**: Added `name_folded` column (macOS only) storing pre-computed `normalize_for_comparison(name)`. Replaced `idx_parent` with platform-conditional composite indexes: `idx_parent_name_folded(parent_id, name_folded)` on macOS, `idx_parent_name(parent_id, name)` on Linux/Windows. `resolve_component` now queries the index directly instead of fetching all children and matching in Rust.
95
99
96
100
## How to test
97
101
@@ -125,6 +129,8 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
125
129
126
130
**In-memory accumulation eliminates aggregation SQL queries**: During a full scan, the writer thread accumulates two HashMaps in `AccumulatorMaps` as `InsertEntriesV2` batches arrive: `direct_stats` (parent_id -> file size/count/dir count) and `child_dirs` (parent_id -> child dir IDs). When `ComputeAllAggregates` fires, these maps are passed to `compute_all_aggregates_with_maps()`, skipping the two expensive full-table-scan SQL queries (`bulk_get_children_stats_by_id` and `bulk_get_child_dir_ids`) that previously dominated aggregation time (~70%). Maps are cleared on `TruncateData` and after aggregation completes. Falls back to SQL queries if maps are empty.
127
131
132
+
**Pre-computed `name_folded` instead of SQL collation in the index (macOS)**: The old composite index `idx_parent_name(parent_id, name)` with `platform_case` collation took ~25 min to build for 5.1M entries because every B-tree comparison invoked NFD + case fold. The v5 workaround (simple `idx_parent` + match in Rust) required fetching all children per parent. `name_folded` stores the pre-computed `normalize_for_comparison(name)` at insert time, so the composite index uses binary collation and builds in seconds. `resolve_component` gets O(log n) lookups via a single indexed query.
133
+
128
134
**Subtree aggregation uses scoped queries**: `scoped_get_children_stats_by_id` and `scoped_get_child_dir_ids` in `aggregator.rs` use recursive CTEs scoped to the target subtree, not full-table scans. This keeps subtree aggregation O(subtree_size) regardless of total DB size.
129
135
130
136
**Bounded buffers prevent OOM**: All buffers have capacity limits. FSEvents channel: 32K batches (bounded `try_send` in cmdr-fsevent-stream; overflow sets atomic flag, triggers rescan). Reconciler buffer: 500K events (overflow triggers full rescan). Writer channel: 20K messages (bounded `sync_channel`, backpressure). Replay `affected_paths`: 50K entries (overflow emits full refresh). Replay `pending_rescans`: 1K entries (overflow triggers full rescan). Replay event count: 1M events max (overflow falls back to full scan). Memory watchdog: warns at 8 GB, stops indexing at 16 GB. The index is a disposable cache, so dropping events and rescanning is always safe.
@@ -175,6 +181,6 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
175
181
176
182
**Reconciler must delete old subtree on dir-to-file type changes**: When `reconcile_subtree` matches a filesystem entry to a DB entry by name, it must check if `is_directory` changed. If a directory became a file, `DeleteSubtreeById` must be sent before `UpsertEntryV2`. Without this, `INSERT OR REPLACE` keeps the same row ID (same `parent_id + name`), and the old directory's children become logical orphans — entries parented by a file.
177
183
178
-
**Scanner's `insert_entries_v2_batch` uses plain `INSERT`**: With the old `idx_parent_name` unique index, `INSERT OR REPLACE` would silently delete the old row and insert a new one with a new ID, orphaning all children. That unique index is gone (replaced by `idx_parent`), and the only unique constraint is the integer PK (`id`). Since `ScanContext` assigns unique IDs and the table is truncated before full scans (or descendants deleted before subtree scans), PK conflicts shouldn't occur. The batch insert uses plain `INSERT` to reflect this.
184
+
**Scanner's `insert_entries_v2_batch` uses plain `INSERT`**: With the old `idx_parent_name` unique index, `INSERT OR REPLACE` would silently delete the old row and insert a new one with a new ID, orphaning all children. That unique index is gone (replaced by `idx_parent_name_folded` on macOS / `idx_parent_name` on Linux), and the only unique constraint is the integer PK (`id`). Since `ScanContext` assigns unique IDs and the table is truncated before full scans (or descendants deleted before subtree scans), PK conflicts shouldn't occur. The batch insert uses plain `INSERT` to reflect this.
179
185
180
186
**IndexWriter exposes `db_path()`**: The scanner needs the DB path to open a temporary connection for `ScanContext::new()`. This path is stored on the `IndexWriter` handle and accessible via `db_path()`. The temporary connection is short-lived (only used to read `MAX(id)`).
0 commit comments