vdavid
diff --git a/‎apps/desktop/src-tauri/src/indexing/CLAUDE.md‎
Lines changed: 9 additions & 4 deletions b/‎apps/desktop/src-tauri/src/indexing/CLAUDE.md‎
Lines changed: 9 additions & 4 deletions
@@ -15,7 +15,7 @@ Design history is in git (former `docs/specs/drive-indexing/`).
 - **expected_totals.rs** -- `expected_totals_for_sources()` returns the index-derived `(file count, byte total)` for a set of source paths so write operations (copy/move/delete) can render a real scan-phase progress bar before the foolproof re-scan completes. Per source: resolve_path → get_entry_by_id → if dir use `dir_stats`, if file use the entry's `logical_size`. Returns `None` if *any* source isn't covered by the index (no pool, no entry, no `dir_stats`, no `logical_size`). Partial totals would let the progress bar overshoot 100%. Uses the same `ReadPool` as enrichment for lock-free reads. Used by `scan_preview.rs` and `scan.rs` in `write_operations/`.
 - **event_loop.rs** -- `run_live_event_loop` (real-time FSEvents/inotify processing after scan completes), `run_replay_event_loop` (cold-start journal replay with two-phase approach), `run_background_verification` (post-replay bidirectional readdir diff), `merge_fs_events` (deduplication with flag priority), `process_live_batch` (three-phase: directory creations first sorted by depth + flush, then `detect_renames_by_inode` rename pre-pass + flush, then remaining events). The rename pre-pass turns `item_renamed` events whose new path's inode already lives in the DB at a different `(parent_id, name)` into a single `MoveEntryV2`, preserving the entry's `dir_stats`. All bounded-buffer constants live here.
 - **events.rs** -- Tauri event payload structs (`IndexScanStartedEvent`, `IndexScanProgressEvent`, `IndexScanCompleteEvent`, `IndexDirUpdatedEvent`, `IndexReplayProgressEvent`, `IndexReplayCompleteEvent`), `RescanReason` enum, `emit_rescan_notification()`, IPC response types (`IndexStatusResponse`, `IndexDebugStatusResponse`). Also: `ActivityPhase` enum (Replaying/Scanning/Aggregating/Reconciling/Live/Idle), `PhaseRecord` for the phase timeline, and `DebugStats` (shared atomic counters for the debug window + phase timeline via `set_phase()`/`close_phase_with_stats()`).
-- **store.rs** -- SQLite schema v9 (integer-keyed entries with `name_folded` column on all platforms, `inode` column for hardlink dedup, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. `resolve_component` always queries by `(parent_id, name_folded)` using the `idx_parent_name_folded` composite index. On Linux/Windows, `normalize_for_comparison()` is the identity function, so `name_folded = name` and the index behaves identically to a `(parent_id, name)` index. Schema version check: mismatch triggers drop+rebuild. v7 added dual sizes (logical + physical). v8 adds `inode INTEGER` column and `idx_inode` index for hardlink dedup at write time. v9 unifies `name_folded` across all platforms (eliminates platform-conditional SQL). `has_sized_entry_for_inode()` checks if another entry with the same inode already has non-NULL sizes; `find_entry_by_inode()` returns the first row with a given inode (used by the live event loop's rename pre-pass). Both path-keyed (backward compat) and integer-keyed APIs.
+- **store.rs** -- SQLite schema v12 (integer-keyed entries with `name_folded` column on all platforms, `inode` column for hardlink dedup, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. `resolve_component` always queries by `(parent_id, name_folded)` using the `idx_parent_name_folded` composite **UNIQUE** index. On Linux/Windows, `normalize_for_comparison()` is the identity function, so `name_folded = name` and the index behaves identically to a `(parent_id, name)` index. Schema version check: mismatch triggers drop+rebuild. v7 added dual sizes (logical + physical). v8 adds `inode INTEGER` column and `idx_inode` index for hardlink dedup at write time. v9 unifies `name_folded` across all platforms (eliminates platform-conditional SQL). v12 reinstates `UNIQUE` on `(parent_id, name_folded)` so the DB physically rejects duplicate rows (the v5 perf reason no longer applies since v6 pre-computes `name_folded`). `has_sized_entry_for_inode()` checks if another entry with the same inode already has non-NULL sizes; `find_entry_by_inode()` returns the first row with a given inode (used by the live event loop's rename pre-pass). Both path-keyed (backward compat) and integer-keyed APIs.
 - **metadata.rs** -- `MetadataSnapshot` struct and `extract_metadata()` function. Single location for all platform-specific metadata extraction (logical/physical size, mtime, inode, nlink). Used by scanner, reconciler, verifier, and event_loop. Symlinks get `None` everywhere. Files get sizes + inode + nlink. Directories get inode but no sizes/nlink. The inode is what the live event loop's rename pre-pass matches against to detect dir renames in place.
 - **memory_watchdog.rs** -- Background task monitoring resident memory via `mach_task_info` (macOS). Warns at 8 GB, stops indexing at 16 GB, emits `index-memory-warning` event to frontend. No-op stub on non-macOS. Started from `start_indexing()`.
 - **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). `WRITER_GENERATION: AtomicU64` (initialized to 1) bumped on every mutation (`InsertEntriesV2`, `UpsertEntryV2`, `MoveEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `TruncateData`) for search index staleness detection. Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `MoveEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. `MoveEntryV2 { entry_id, new_parent_id, new_name }` updates an entry's `(parent_id, name, name_folded)` in place, preserving its `id` and (for directories) `dir_stats`. Same-parent renames don't change ancestor totals; cross-parent moves subtract the entry's contribution from the old ancestor chain and add it to the new one (and recompute the OR-aggregated `recursive_has_symlinks` flag on both chains). The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` auto-propagates deltas on both insert and update: on insert, propagates the full size (+file_count or +dir_count); on update, reads the old entry first and propagates only the size difference. This means callers never need a separate `PropagateDeltaById` for upserted entries. For new directories, also initializes a zero-valued `dir_stats` row so enrichment always has a row. Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + an `entries_inserted` counter), cleared on `TruncateData`. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle). No index drop/recreate dance. The `idx_parent_name_folded` composite index uses binary collation and stays present during scans.
@@ -81,15 +81,15 @@ All writes go through a dedicated `std::thread` via a bounded `sync_channel` (20
 
 Reads happen on separate WAL connections (any thread). A `ReadPool` provides thread-local read connections for enrichment and verification without contending on the `INDEXING` state-machine mutex.
 
-### SQLite schema (v11: integer-keyed, unified `name_folded` column, inode for hardlink dedup + dir rename detection, `recursive_has_symlinks` flag)
+### SQLite schema (v12: integer-keyed, unified `name_folded` column, inode for hardlink dedup + dir rename detection, `recursive_has_symlinks` flag, UNIQUE `(parent_id, name_folded)`)
 
 One DB per volume. **Dev and prod use separate directories** (see AGENTS.md § Debugging):
 - **Prod**: `~/Library/Application Support/com.veszelovszki.cmdr/index-{volume_id}.db`
 - **Dev**: `~/Library/Application Support/com.veszelovszki.cmdr-dev/index-{volume_id}.db`
 
 Three tables:
 - `entries` (id INTEGER PK, parent_id, name COLLATE platform_case, name_folded, is_directory, is_symlink, logical_size, physical_size, modified_at, inode). Root sentinel: id=1, parent_id=0, name="".
-  - `name_folded TEXT NOT NULL` stores `normalize_for_comparison(name)` on all platforms. On macOS this is NFD + case fold; on Linux/Windows it's the identity (same as `name`). Index: `idx_parent_name_folded ON entries (parent_id, name_folded)`.
+  - `name_folded TEXT NOT NULL` stores `normalize_for_comparison(name)` on all platforms. On macOS this is NFD + case fold; on Linux/Windows it's the identity (same as `name`). Index: `idx_parent_name_folded ON entries (parent_id, name_folded)` (UNIQUE since v12).
   - The old `idx_parent(parent_id)` from v5 is removed; the composite index replaces it.
 - `dir_stats` (entry_id INTEGER PK, recursive_logical_size, recursive_physical_size, recursive_file_count, recursive_dir_count, recursive_has_symlinks)
 - `meta` (key TEXT PK, value TEXT) WITHOUT ROWID
@@ -106,6 +106,7 @@ History of changes:
 - **Schema v9**: Unified `name_folded` column and `idx_parent_name_folded` index across all platforms (previously macOS only). On Linux/Windows, `normalize_for_comparison()` is the identity function, so `name_folded = name`, one extra column with no behavioral change. Eliminates 12 `#[cfg(target_os)]` blocks from store.rs SQL.
 - **Schema v10**: Added `recursive_has_symlinks INTEGER NOT NULL DEFAULT 0` to `dir_stats`. OR-aggregated bottom-up: a directory's flag is `true` iff any direct child is a symlink OR any child directory's stored flag is `true`. Surfaced via `DirStats.recursive_has_symlinks` (IPC) and `FileEntry.recursive_has_symlinks` (enrichment). The frontend renders an inline `(i)` icon next to a folder's size in `SelectionInfo` to explain why a symlink-only folder shows 0 bytes (we follow `du`/Finder's behavior of not double-counting symlinked content). Schema bump triggers a full re-scan on next launch; no online migration. The index is a disposable cache.
 - **Schema v11**: No SQL change. Bumped to force a rebuild so the scanner repopulates `entries.inode` for directories (used to be `None`; see `metadata.rs`). Required for inode-based rename detection in the live event loop's `detect_renames_by_inode` pre-pass to actually do anything on existing user databases; without the bump, every directory in a pre-v11 DB would have `inode IS NULL` and `find_entry_by_inode` would never match.
+- **Schema v12**: Reinstated `UNIQUE` on `(parent_id, name_folded)` (the constraint v5 dropped). v5's perf reason was the `platform_case` collation B-tree cost; v6 sidestepped that by pre-computing `name_folded` at insert time, so the unique index now uses binary collation and builds in seconds. Without the constraint, two concurrent `IndexWriter` instances racing on the same DB (each with its own `Arc<AtomicI64>` ID counter) could insert distinct rows for the same `(parent_id, name)`, which the aggregator then double-counted into `dir_stats` — observed as a 1.83 TB ghost size on the `..` row of a 994 GB volume. The state-machine fix (lock-first `start_indexing`) closes the trigger; this schema constraint is the safety net.
 
 ## How to test
 
@@ -169,6 +170,10 @@ function is shared so the two gate sites can't drift apart.
 
 **Single-writer thread, not connection pooling**: SQLite write concurrency is limited by its single-writer design. Instead of fighting it with `BUSY_TIMEOUT` and retries, one dedicated thread owns the write connection. Eliminates contention entirely.
 
+**Lock-first `start_indexing`, atomic phase reservation**: `start_indexing` opens a temporary `IndexStore` and atomically claims the `Disabled -> Initializing(store)` transition via `try_reserve_initializing_phase()` BEFORE constructing the heavy `IndexManager` (which spawns the writer thread). If the phase is already `Initializing`, `Running`, or `ShuttingDown`, the call is a no-op. The previous design did the inline assignment `*guard = Initializing { store }` AFTER `IndexManager::new`, so two near-simultaneous calls both spawned writer threads — each with its own `Arc<AtomicI64>` ID counter and `AccumulatorMaps` — racing on the same DB. The lock-first guard makes second-call no-op behaviour the contract; the v12 UNIQUE index is the safety net if the contract is ever bypassed.
+
+**`accumulator.accumulate` runs only after the DB commit succeeds**: `handle_insert_entries_v2` previously updated `AccumulatorMaps` before calling `IndexStore::insert_entries_v2_batch`. The batch is wrapped in a SQLite savepoint that rolls back on any conflict (now including UNIQUE on `(parent_id, name_folded)`), so a failed batch left the accumulator claiming bytes that never reached the DB; `compute_all_aggregates_with_maps` then wrote inflated `dir_stats`. The fix is a two-line reorder: insert first, accumulate on `Ok`. With the lock-first guard preventing duplicate writers, the failure path is rare in practice, but the contract — "in-memory state never claims more than the DB has" — should hold under any future failure mode.
+
 **Index enrichment at read time, not cache time**: `recursive_size` fields are populated on every `get_file_range` call via a batch SQLite read from `dir_stats`. This avoids stale data and keeps enrichment consistent with the latest DB state. The cost is microseconds per page on a WAL connection.
 
 **Enrichment uses integer-keyed batch lookup**: Instead of N individual `resolve_path()` calls (one per directory in the listing), `enrich_entries_with_index` resolves the parent directory once, queries `list_child_dir_ids_and_names(parent_id)` for all child dir IDs, then `get_dir_stats_batch_by_ids()`. Two indexed queries total instead of N. Falls back to individual path resolution for edge cases (for example, mixed-parent entries).
@@ -262,7 +267,7 @@ per-event detail back: `RUST_LOG=cmdr_lib::indexing::reconciler=trace,cmdr_lib::
 
 **Reconciler must delete old subtree on dir-to-file type changes**: When `reconcile_subtree` matches a filesystem entry to a DB entry by name, it must check if `is_directory` changed. If a directory became a file, `DeleteSubtreeById` must be sent before `UpsertEntryV2`. Without this, `INSERT OR REPLACE` keeps the same row ID (same `parent_id + name`), and the old directory's children become logical orphans: entries parented by a file.
 
-**Scanner's `insert_entries_v2_batch` uses plain `INSERT`**: With the old `idx_parent_name` unique index, `INSERT OR REPLACE` would silently delete the old row and insert a new one with a new ID, orphaning all children. That unique index is gone (replaced by `idx_parent_name_folded` on macOS / `idx_parent_name` on Linux), and the only unique constraint is the integer PK (`id`). IDs are allocated from a shared `Arc<AtomicI64>` counter owned by `IndexWriter`, and the table is truncated before full scans (or descendants deleted before subtree scans), so PK conflicts shouldn't occur. The batch insert uses plain `INSERT` to reflect this.
+**Scanner's `insert_entries_v2_batch` uses plain `INSERT`**: `INSERT OR REPLACE` would silently delete the old row and insert a new one with a new ID, orphaning all children. The table has two unique constraints: PK on `id`, and (since v12) the composite UNIQUE `(parent_id, name_folded)`. IDs are allocated from a shared `Arc<AtomicI64>` counter owned by `IndexWriter`, and the table is truncated before full scans (or descendants deleted before subtree scans), so under a single live writer neither conflict can occur. The plain `INSERT` reflects the invariant. If a conflict ever does fire (e.g. someone defeats the lock-first guard), the batch savepoint rolls back and `handle_insert_entries_v2` skips the accumulator update, so the in-memory aggregation state stays consistent with the DB.
 
 **IndexWriter owns the shared `next_id` counter**: All ID allocation goes through an `Arc<AtomicI64>` owned by `IndexWriter`. The scanner's `ScanContext` atomically increments it via `alloc_id()`, and the writer thread bumps it via `fetch_max` after `UpsertEntryV2` inserts (which let SQLite auto-assign). `TruncateData` resets it to 2. This eliminates the race where `ScanContext` previously read a stale `MAX(id)` from a read connection while the writer had uncommitted inserts in its channel. `IndexWriter` still exposes `db_path()`. The scanner opens a temporary connection for `resolve_path` (subtree scans) and `ensure_root_sentinel` (volume scans), but no longer for ID allocation.