Indexing: Fix orphaned entries & missing dir sizes

vdavid · vdavid · commit 323ae866827f · 2026-03-05T11:08:29.000+01:00
`INSERT OR REPLACE` in `insert_entries_v2_batch` silently deleted old entries and re-inserted with new IDs during subtree rescans, orphaning all children (their `parent_id` pointed to the deleted old ID). This caused unbounded DB growth (5.5M → 21M entries) and zero `dir_stats`for any entry created after the initial full scan.

- Subtree scans now send `DeleteDescendantsById` before inserting, ensuring a clean slate with no orphans
- Subtree aggregation queries now use recursive CTEs scoped to the target subtree instead of full-table scans
- Schema version bumped to v3 to force DB rebuild on existing installs
- Added debug logging in `ScanContext::new` for path resolution failures
diff --git a/apps/desktop/src-tauri/src/indexing/CLAUDE.md b/apps/desktop/src-tauri/src/indexing/CLAUDE.md
@@ -61,7 +61,7 @@ All writes go through a dedicated `std::thread` via an unbounded mpsc channel. T
 
 Reads happen on separate WAL connections (any thread). The global read-only store (`GLOBAL_INDEX_STORE`) provides enrichment without passing `AppHandle` through the listing pipeline.
 
-### SQLite schema (v2: integer-keyed)
+### SQLite schema (v3: integer-keyed)
 
 One DB per volume: `~/Library/Application Support/com.veszelovszki.cmdr/index-{volume_id}.db`
 
@@ -72,7 +72,7 @@ Three tables:
 
 WAL mode, 64 MB page cache. Custom `platform_case` collation registered on every connection: case-insensitive + NFD normalization on macOS, binary on Linux. **Opening the DB with the sqlite3 CLI will fail** on queries touching the name column (the collation isn't registered).
 
-**Migration in progress**: Schema bumped from v1 to v2. Milestones 1-5 complete. Scanner, writer, aggregator, reconciler, enrichment, and IPC commands all fully migrated to integer keys. `IndexManager` owns a `PathResolver` for LRU-cached path→ID resolution in IPC commands (`get_dir_stats`, `get_dir_stats_batch`). Enrichment uses integer-keyed fast path: resolve parent once → batch child dir stats by ID. Reconciler sends integer-keyed messages exclusively. Old path-keyed `WriteMessage` variants and backward-compat shims (`ScannedEntry`, `DirStats`) still exist for post-replay verification — cleanup in milestone 6. All 848 tests pass.
+**Schema v3**: Bumped from v2 to force DB rebuild after fixing orphan entry bug. Scanner, writer, aggregator, reconciler, enrichment, and IPC commands all fully migrated to integer keys. `IndexManager` owns a `PathResolver` for LRU-cached path→ID resolution in IPC commands (`get_dir_stats`, `get_dir_stats_batch`). Enrichment uses integer-keyed fast path: resolve parent once → batch child dir stats by ID. Reconciler sends integer-keyed messages exclusively. Old path-keyed `WriteMessage` variants and backward-compat shims (`ScannedEntry`, `DirStats`) still exist for post-replay verification — cleanup in milestone 6.
 
 ## How to test
 
@@ -103,6 +103,10 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
 
 **Physical sizes (`st_blocks * 512`)**: More meaningful for disk usage than logical size. May overcount ~10-20% for APFS clones (shared blocks). Volume usage bar uses `statfs()` for true totals.
 
+**Subtree rescans delete descendants first**: `scan_subtree` sends `DeleteDescendantsById(root_id)` to the writer before inserting fresh entries. This prevents orphaned entries that previously caused DB bloat (4x) and missing dir_stats. The root entry is preserved (its existing ID is reused by `ScanContext`). The delete and subsequent inserts are serialized through the single writer channel, so no race conditions. `ComputeSubtreeAggregates` runs after the scan to recompute stats.
+
+**Subtree aggregation uses scoped queries**: `scoped_get_children_stats_by_id` and `scoped_get_child_dir_ids` in `aggregator.rs` use recursive CTEs scoped to the target subtree, not full-table scans. This keeps subtree aggregation O(subtree_size) regardless of total DB size.
+
 **Disposable cache pattern**: The index DB is a cache, not a source of truth. Any corruption or error triggers delete+rebuild. No user-facing errors for DB issues.
 
 **cmdr-fsevent-stream fork (macOS only)**: Our fork of `fsevent-stream` (v0.3.0) provides direct access to FSEvents event IDs, `sinceWhen` replay, and `MustScanSubDirs` flags. Only used on macOS. On Linux, the `notify` crate (inotify backend) provides recursive directory watching with `RecursiveMode::Recursive`.
@@ -137,6 +141,8 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
 
 **Reconciler holds a read connection**: `process_fs_event`, `replay`, and `process_live_event` all require a `&Connection` parameter for path-to-ID resolution. Callers (event loops in mod.rs) open a read connection via `IndexStore::open_write_connection(writer.db_path())` at loop start and pass it through. This is a WAL-mode connection so it doesn't block the writer. The `IndexManager` also owns a `PathResolver` with LRU cache, used by IPC commands (`get_dir_stats`, `get_dir_stats_batch`) for cached resolution. The event loops don't use the `PathResolver` yet because they run in separate async tasks -- could be migrated in a future optimization pass.
 
-**ScanContext maps scan root to ROOT_ID**: Both `scan_volume` and `scan_subtree` create a `ScanContext` that maps the scan root directory to `ROOT_ID` (1). This means all top-level entries under any scan root get `parent_id = ROOT_ID` in the DB. For subtree scans, this means the scan root itself isn't stored as an entry — only its children are. The `ScanContext` opens a temporary read connection to the DB to fetch `next_id` via `get_next_id()`.
+**ScanContext maps scan root to ROOT_ID**: Both `scan_volume` and `scan_subtree` create a `ScanContext` that maps the scan root directory to `ROOT_ID` (1). This means all top-level entries under any scan root get `parent_id = ROOT_ID` in the DB. For subtree scans, the root is resolved to its existing entry ID (not ROOT_ID), and `DeleteDescendantsById` is sent before the scan starts. The `ScanContext` opens a temporary read connection to the DB to fetch `next_id` via `get_next_id()`.
+
+**Never use `INSERT OR REPLACE` on entries without deleting descendants first**: `INSERT OR REPLACE` on the `idx_parent_name` unique index silently deletes the old row and inserts a new one with a new ID. This orphans all children (their `parent_id` points to the deleted old ID) and orphans the old `dir_stats` row. The scanner's `insert_entries_v2_batch` still uses `INSERT OR REPLACE` as a safety net, but it's always preceded by `DeleteDescendantsById` for subtree scans, so no conflicts should occur in practice.
 
 **IndexWriter exposes `db_path()`**: The scanner needs the DB path to open a temporary connection for `ScanContext::new()`. This path is stored on the `IndexWriter` handle and accessible via `db_path()`. The temporary connection is short-lived (only used to read `MAX(id)`).
diff --git a/apps/desktop/src-tauri/src/indexing/aggregator.rs b/apps/desktop/src-tauri/src/indexing/aggregator.rs
@@ -127,16 +127,15 @@ pub fn compute_subtree_aggregates(conn: &Connection, root: &str) -> Result<u64,
     let dir_count = dir_entries.len();
     log::debug!("Subtree aggregation: starting bottom-up computation for {dir_count} directories under {root}");
 
-    // Load direct children stats scoped to this subtree
-    let dir_id_set: std::collections::HashSet<i64> = dir_entries.iter().map(|&(id, _)| id).collect();
-    let direct_stats = scoped_get_children_stats_by_id(conn, &dir_id_set)?;
+    // Load direct children stats scoped to this subtree via recursive CTE
+    let direct_stats = scoped_get_children_stats_by_id(conn, root_id)?;
     log::debug!(
         "Subtree aggregation: loaded stats for {} parent IDs in {:.1}ms",
         direct_stats.len(),
         start.elapsed().as_secs_f64() * 1000.0,
     );
 
-    let child_dirs_map = scoped_get_child_dir_ids(conn, &dir_id_set)?;
+    let child_dirs_map = scoped_get_child_dir_ids(conn, root_id)?;
     log::debug!(
         "Subtree aggregation: loaded child dirs for {} parent IDs in {:.1}ms",
         child_dirs_map.len(),
@@ -325,35 +324,70 @@ fn bulk_get_child_dir_ids(conn: &Connection) -> Result<HashMap<i64, Vec<i64>>, I
     Ok(map)
 }
 
-/// Load direct children stats scoped to a set of directory IDs.
+/// Load direct children stats scoped to a subtree via recursive CTE.
 ///
 /// Returns a map: `parent_id -> (total_file_size, file_count, dir_count)`.
-/// Only includes results where `parent_id` is in the provided set.
+/// Only includes entries whose parent is within the subtree rooted at `root_id`.
 fn scoped_get_children_stats_by_id(
     conn: &Connection,
-    dir_ids: &std::collections::HashSet<i64>,
+    root_id: i64,
 ) -> Result<HashMap<i64, (u64, u64, u64)>, IndexStoreError> {
-    // Use bulk query and filter in memory (more efficient than N individual queries)
-    let all_stats = bulk_get_children_stats_by_id(conn)?;
-    Ok(all_stats
-        .into_iter()
-        .filter(|(parent_id, _)| dir_ids.contains(parent_id))
-        .collect())
+    let mut stmt = conn.prepare(
+        "WITH RECURSIVE subtree(id) AS (
+            SELECT id FROM entries WHERE id = ?1
+            UNION ALL
+            SELECT e.id FROM entries e JOIN subtree s ON e.parent_id = s.id
+        )
+        SELECT e.parent_id,
+               COALESCE(SUM(CASE WHEN e.is_directory = 0 THEN e.size ELSE 0 END), 0),
+               COALESCE(SUM(CASE WHEN e.is_directory = 0 THEN 1 ELSE 0 END), 0),
+               COALESCE(SUM(CASE WHEN e.is_directory = 1 THEN 1 ELSE 0 END), 0)
+        FROM entries e
+        WHERE e.parent_id IN (SELECT id FROM subtree)
+        GROUP BY e.parent_id",
+    )?;
+    let rows = stmt.query_map(params![root_id], |row| {
+        Ok((
+            row.get::<_, i64>(0)?,
+            row.get::<_, u64>(1)?,
+            row.get::<_, u64>(2)?,
+            row.get::<_, u64>(3)?,
+        ))
+    })?;
+    let mut map = HashMap::new();
+    for row in rows {
+        let (parent_id, size, files, dirs) = row?;
+        map.insert(parent_id, (size, files, dirs));
+    }
+    Ok(map)
 }
 
-/// Load child directory IDs scoped to a set of parent directory IDs.
+/// Load child directory IDs scoped to a subtree via recursive CTE.
 ///
 /// Returns a map: `parent_id -> Vec<child_dir_id>`.
-/// Only includes results where `parent_id` is in the provided set.
+/// Only includes entries whose parent is within the subtree rooted at `root_id`.
 fn scoped_get_child_dir_ids(
     conn: &Connection,
-    dir_ids: &std::collections::HashSet<i64>,
+    root_id: i64,
 ) -> Result<HashMap<i64, Vec<i64>>, IndexStoreError> {
-    let all_children = bulk_get_child_dir_ids(conn)?;
-    Ok(all_children
-        .into_iter()
-        .filter(|(parent_id, _)| dir_ids.contains(parent_id))
-        .collect())
+    let mut stmt = conn.prepare(
+        "WITH RECURSIVE subtree(id) AS (
+            SELECT id FROM entries WHERE id = ?1
+            UNION ALL
+            SELECT e.id FROM entries e JOIN subtree s ON e.parent_id = s.id
+        )
+        SELECT e.parent_id, e.id FROM entries e
+        WHERE e.parent_id IN (SELECT id FROM subtree) AND e.is_directory = 1",
+    )?;
+    let rows = stmt.query_map(params![root_id], |row| {
+        Ok((row.get::<_, i64>(0)?, row.get::<_, i64>(1)?))
+    })?;
+    let mut map: HashMap<i64, Vec<i64>> = HashMap::new();
+    for row in rows {
+        let (parent_id, child_id) = row?;
+        map.entry(parent_id).or_default().push(child_id);
+    }
+    Ok(map)
 }
 
 // ── Tests ────────────────────────────────────────────────────────────
diff --git a/apps/desktop/src-tauri/src/indexing/mod.rs b/apps/desktop/src-tauri/src/indexing/mod.rs
@@ -1467,6 +1467,9 @@ fn verify_affected_dirs(affected_paths: &std::collections::HashSet<String>, writ
             });
 
             if is_dir {
+                log::debug!(
+                    "verify_affected_dirs: new dir on disk: {normalized} (parent_id={parent_id})"
+                );
                 new_dir_paths.push(normalized);
             } else if let Some(sz) = size {
                 // UpsertEntryV2 inserts the entry; propagate its size delta up the
diff --git a/apps/desktop/src-tauri/src/indexing/scanner.rs b/apps/desktop/src-tauri/src/indexing/scanner.rs
@@ -274,6 +274,17 @@ fn run_scan(
         ScanContext::new(&conn, root, is_volume_root).map_err(|e| ScanError::WriterSend(e.to_string()))?
     };
 
+    // For subtree rescans, delete existing descendants first to prevent orphaned entries.
+    // The scan will re-insert fresh children with correct parent-child relationships.
+    // The root entry itself is preserved (ScanContext resolved its existing ID).
+    if !is_volume_root
+        && let Some(&root_id) = scan_ctx.dir_ids.get(root)
+    {
+        writer
+            .send(WriteMessage::DeleteDescendantsById(root_id))
+            .map_err(|e| ScanError::WriterSend(e.to_string()))?;
+    }
+
     let walker = build_walker(root, num_threads, is_volume_root);
 
     for entry_result in walker {
diff --git a/apps/desktop/src-tauri/src/indexing/store.rs b/apps/desktop/src-tauri/src/indexing/store.rs
@@ -19,7 +19,7 @@
 use rusqlite::{Connection, OptionalExtension, params};
 use std::path::{Path, PathBuf};
 
-const SCHEMA_VERSION: &str = "2";
+const SCHEMA_VERSION: &str = "3";
 
 /// Root entry sentinel ID. All top-level entries have `parent_id = ROOT_ID`.
 pub const ROOT_ID: i64 = 1;
@@ -98,9 +98,35 @@ impl ScanContext {
         let root_id = if is_volume_root {
             ROOT_ID
         } else {
-            match resolve_path(conn, &root.to_string_lossy())? {
+            let root_str = root.to_string_lossy();
+            match resolve_path(conn, &root_str)? {
                 Some(id) => id,
                 None => {
+                    // Diagnose which component is missing by walking the path
+                    let stripped = root_str.strip_prefix('/').unwrap_or(&root_str);
+                    let mut current_id = ROOT_ID;
+                    for component in stripped.split('/') {
+                        if component.is_empty() {
+                            continue;
+                        }
+                        match IndexStore::resolve_component(conn, current_id, component) {
+                            Ok(Some(id)) => current_id = id,
+                            Ok(None) => {
+                                log::debug!(
+                                    "ScanContext::new: resolve_path({root_str}) failed at \
+                                     component \"{component}\" (parent_id={current_id})"
+                                );
+                                break;
+                            }
+                            Err(e) => {
+                                log::debug!(
+                                    "ScanContext::new: resolve_path({root_str}) errored at \
+                                     component \"{component}\" (parent_id={current_id}): {e}"
+                                );
+                                break;
+                            }
+                        }
+                    }
                     return Err(IndexStoreError::Sqlite(rusqlite::Error::QueryReturnedNoRows));
                 }
             }
@@ -821,6 +847,33 @@ impl IndexStore {
         Ok(())
     }
 
+    /// Delete all descendants of an entry (but not the entry itself) using recursive CTE.
+    ///
+    /// Used before subtree rescans to prevent orphaned entries. The root entry is kept
+    /// because the scanner's `ScanContext` resolves it by path and uses its existing ID.
+    pub fn delete_descendants_by_id(conn: &Connection, root_id: i64) -> Result<(), IndexStoreError> {
+        // Collect descendant IDs (excluding root) then delete dir_stats and entries
+        conn.execute(
+            "WITH RECURSIVE descendants(id) AS (
+                SELECT id FROM entries WHERE parent_id = ?1
+                UNION ALL
+                SELECT e.id FROM entries e JOIN descendants d ON e.parent_id = d.id
+            )
+            DELETE FROM dir_stats WHERE entry_id IN (SELECT id FROM descendants)",
+            params![root_id],
+        )?;
+        conn.execute(
+            "WITH RECURSIVE descendants(id) AS (
+                SELECT id FROM entries WHERE parent_id = ?1
+                UNION ALL
+                SELECT e.id FROM entries e JOIN descendants d ON e.parent_id = d.id
+            )
+            DELETE FROM entries WHERE id IN (SELECT id FROM descendants)",
+            params![root_id],
+        )?;
+        Ok(())
+    }
+
     /// Delete an entire subtree by root entry ID using recursive CTE.
     ///
     /// No internal transaction: safe to call inside an outer `BEGIN IMMEDIATE`.
@@ -952,7 +1005,7 @@ mod tests {
     fn schema_creation_and_version() {
         let (store, _dir) = open_temp_store();
         let status = store.get_index_status().unwrap();
-        assert_eq!(status.schema_version.as_deref(), Some("2"));
+        assert_eq!(status.schema_version.as_deref(), Some(SCHEMA_VERSION));
     }
 
     #[test]
@@ -1120,7 +1173,7 @@ mod tests {
 
         // Schema version should be re-stamped
         let version = IndexStore::get_meta(&write_conn, "schema_version").unwrap();
-        assert_eq!(version.as_deref(), Some("2"));
+        assert_eq!(version.as_deref(), Some(SCHEMA_VERSION));
 
         // Entries should be gone (except root sentinel)
         let children = store.list_children(ROOT_ID).unwrap();
@@ -1142,7 +1195,7 @@ mod tests {
         // Re-open: should detect mismatch and reset
         let store = IndexStore::open(&db_path).unwrap();
         let status = store.get_index_status().unwrap();
-        assert_eq!(status.schema_version.as_deref(), Some("2"));
+        assert_eq!(status.schema_version.as_deref(), Some(SCHEMA_VERSION));
     }
 
     #[test]
@@ -1156,7 +1209,7 @@ mod tests {
         // open() should recover by deleting and recreating
         let store = IndexStore::open(&db_path).unwrap();
         let status = store.get_index_status().unwrap();
-        assert_eq!(status.schema_version.as_deref(), Some("2"));
+        assert_eq!(status.schema_version.as_deref(), Some(SCHEMA_VERSION));
     }
 
     #[test]
diff --git a/apps/desktop/src-tauri/src/indexing/writer.rs b/apps/desktop/src-tauri/src/indexing/writer.rs
@@ -35,6 +35,9 @@ pub enum WriteMessage {
     DeleteEntryById(i64),
     /// Watcher: delete a subtree (directory removed with all children) by entry ID.
     DeleteSubtreeById(i64),
+    /// Scanner: delete all descendants of an entry before a subtree rescan.
+    /// Prevents orphaned entries when re-scanning an already-indexed subtree.
+    DeleteDescendantsById(i64),
     /// Watcher: incremental delta propagation walking the parent_id chain.
     PropagateDeltaById {
         entry_id: i64,
@@ -182,7 +185,9 @@ impl WriterStats {
             WriteMessage::InsertEntriesV2(_) => self.current.insert_entries += 1,
             WriteMessage::UpsertEntryV2 { .. } => self.current.upsert_entry += 1,
             WriteMessage::DeleteEntryById(_) => self.current.delete_entry += 1,
-            WriteMessage::DeleteSubtreeById(_) => self.current.delete_subtree += 1,
+            WriteMessage::DeleteSubtreeById(_) | WriteMessage::DeleteDescendantsById(_) => {
+                self.current.delete_subtree += 1;
+            }
             WriteMessage::PropagateDeltaById { .. } => self.current.propagate_delta += 1,
             WriteMessage::ComputeAllAggregates | WriteMessage::ComputeSubtreeAggregates { .. } => {
                 self.current.compute_aggregates += 1;
@@ -307,10 +312,17 @@ fn process_message(conn: &rusqlite::Connection, msg: WriteMessage, stats: &Write
                     }
                 }
                 Ok(None) => {
-                    if let Err(e) =
-                        IndexStore::insert_entry_v2(conn, parent_id, &name, is_directory, is_symlink, size, modified_at)
-                    {
-                        log::warn!("Index writer: insert_entry_v2 failed for {name}: {e}");
+                    match IndexStore::insert_entry_v2(
+                        conn, parent_id, &name, is_directory, is_symlink, size, modified_at,
+                    ) {
+                        Ok(new_id) => {
+                            log::debug!(
+                                "Writer: UpsertEntryV2 inserted \"{name}\" (parent_id={parent_id}) → id={new_id}"
+                            );
+                        }
+                        Err(e) => {
+                            log::warn!("Index writer: insert_entry_v2 failed for {name}: {e}");
+                        }
                     }
                 }
                 Err(e) => {
@@ -349,6 +361,13 @@ fn process_message(conn: &rusqlite::Connection, msg: WriteMessage, stats: &Write
                 propagate_delta_by_id(conn, pid, size_delta, file_delta, dir_delta);
             }
         }
+        WriteMessage::DeleteDescendantsById(root_id) => {
+            // No delta propagation: the subtree will be immediately re-scanned and
+            // ComputeSubtreeAggregates will recompute stats for the subtree root.
+            if let Err(e) = IndexStore::delete_descendants_by_id(conn, root_id) {
+                log::warn!("Index writer: delete_descendants_by_id failed for id={root_id}: {e}");
+            }
+        }
         WriteMessage::PropagateDeltaById {
             entry_id,
             size_delta,