SMB tests: PID-unique write dir + stack docs

vdavid · vdavid · commit 7905a4ea451f · 2026-06-05T23:53:40.000+02:00
- `run_concurrent_write_pass` now prefixes its dest dir with the PID (`{TEST_PREFIX_ROOT}{pid}-{ts}-n{n_files}`), mirroring `test_dir_name()`. The old 1-second `ts` granularity was collision-prone across concurrent worktree sessions sharing the same `smb-consumer-maxreadsize` container — the only fixed-path write the concurrency audit flagged.
- Note in the soak test that a manual soak shares the machine-wide stack; a concurrent load that ramps mid-soak can inflate its relative drift ratio.
- Document the shared-stack lease model: holder-id leases, the `manual` sentinel, adopt-or-start, lock-held teardown, and how to force-down a lingering stack. Updated `smb-servers/README.md` (with a Decision/Why), `scripts/check/CLAUDE.md` (orchestrator's machine-wide leasing), and `docs/tooling/testing.md` (concurrent runs coexist + leaked-stack recovery).
diff --git a/apps/desktop/src-tauri/src/file_system/volume/backends/smb_integration_test.rs b/apps/desktop/src-tauri/src/file_system/volume/backends/smb_integration_test.rs
@@ -1438,7 +1438,12 @@ async fn run_concurrent_write_pass(
         .duration_since(std::time::UNIX_EPOCH)
         .unwrap()
         .as_secs();
-    let unique_prefix = format!("{TEST_PREFIX_ROOT}{ts}-n{n_files}");
+    // Include the PID so two concurrent runs (different worktrees sharing the
+    // same `smb-consumer` container) never target the same dest dir within the
+    // same wall-clock second. Mirrors `test_dir_name()`'s uniqueness recipe;
+    // `ts`'s 1-second granularity alone is collision-prone across sessions.
+    let pid = std::process::id();
+    let unique_prefix = format!("{TEST_PREFIX_ROOT}{pid}-{ts}-n{n_files}");
 
     let dest_dir_abs = mount_path.join(unique_prefix.trim_start_matches('/'));
     let _ = vol.create_directory(&mount_path.join("_test")).await;
diff --git a/apps/desktop/src-tauri/src/file_system/volume/backends/smb_soak_test.rs b/apps/desktop/src-tauri/src/file_system/volume/backends/smb_soak_test.rs
@@ -5,6 +5,13 @@
 //! `CMDR_SOAK_ITERATIONS` or `CMDR_SOAK_DURATION_SECS` set. Declared as a
 //! `#[cfg(test)]` submodule of `smb`; shared helpers come from
 //! `super::smb_test_support`.
+//!
+//! Run a manual soak alone: it shares the machine-wide `smb-consumer` stack
+//! with any other live SMB work (a sibling worktree's integration suite, an
+//! E2E run), and a concurrent load that *ramps* mid-soak can inflate the
+//! drift ratio below into a false failure. The drift assertion is relative
+//! (last-10% vs first-10%), so a *uniform* concurrent slowdown won't trip it,
+//! but a load that grows over the soak's lifetime can.
 
 use super::smb_test_support::*;
 use super::*;
diff --git a/apps/desktop/test/smb-servers/README.md b/apps/desktop/test/smb-servers/README.md
@@ -19,3 +19,58 @@ CI runs the Rust SMB integration tests automatically via the `desktop-rust-integ
 Locally, `./scripts/check.sh --rust` includes the same check.
 
 See [docs/guides/testing/smb-servers.md](../../../../docs/guides/testing/smb-servers.md) for the full documentation.
+
+## Shared stack across worktrees (the lease)
+
+The `smb-consumer` stack is a **machine-wide shared resource**. Every bring-up (this `start.sh`, the check-runner's
+orchestrator, `e2e-linux.sh`) and every teardown routes through a Go lease helper (`scripts/check/smblease`) so
+concurrent sessions in different git worktrees stop tearing each other's containers down. You don't normally interact
+with the lease directly — `start.sh` / `stop.sh` handle it — but here's the model:
+
+- **Holder-id leases.** Each live user writes one file under `/tmp/cmdr-smb-leases/<holder-id>`, guarded by a flock on
+  `/tmp/cmdr-smb.lock`. Bring-up **adopts** an already-serving stack (no compose call) or **reconciles** it via `up -d`;
+  teardown removes the caller's lease and downs the stack **only when zero leases remain**.
+- **The `manual` sentinel.** A bare `./start.sh` registers as the holder-id `manual`. It's the one lease the dead-PID
+  sweep never reaps (it's non-numeric), because `start.sh` exits seconds after the `up` — a PID-keyed lease would be
+  swept on the next acquire and tear the stack down under a live session. Numeric holders (`e2e-linux.sh`'s `$$`, the
+  orchestrator's `check.sh` PID) are long-lived processes, so their leases are swept when the process dies.
+- **`./stop.sh`** releases the `manual` lease. If another session still holds a lease, the stack stays **up** — running
+  `stop.sh` while a sibling worktree's suite is live no longer kills it.
+
+### Force-down a lingering stack
+
+A forgotten `manual` lease (or a leaked numeric one) keeps the stack up. To reap it when you're sure nothing else needs
+it:
+
+```bash
+rm -rf /tmp/cmdr-smb-leases && ./stop.sh   # clear all leases, then down
+# or just confirm the state first:
+(cd ../../../../scripts/check && go run ./smb-lease status)
+```
+
+`contention-check.sh` in this directory is the repeatable acceptance test for the whole mechanism: a dummy holder must
+survive another session's full acquire→run→release cycle, and the stack must down only at zero holders.
+
+### Decision: holder-id leases + adopt-or-start + lock-held teardown
+
+**Why a lease at all.** All worktrees resolve the same `smb-consumer` project on the same fixed host ports, so any one
+session's raw `compose down` (from `stop.sh`, the orchestrator's `Stop`, or `e2e-linux.sh`'s restart path) tore the
+shared stack out from under a live suite in another worktree, producing `Cannot reach smb-consumer-X` cascades —
+observed repeatedly. A second session's `up` with slightly different config could `--force-recreate` the running
+containers mid-run. The lease closes both races.
+
+**Why adopt-or-start, not just `up -d`.** When the stack is already serving the requested config, the helper issues **no
+compose call** — it adopts. That's what prevents the recreate-mid-run failure. A blind `up -d` from a second session
+could disturb healthy containers; adoption never touches them.
+
+**Why the lock is held across the `down`.** Releasing the flock before the `compose down` reopens the exact teardown
+race we're closing: an arriving acquirer would see zero leases, start a fresh `up` while the old `down` is mid-flight,
+and get half-torn-down containers. Acquire → re-verify zero → down → release all happen inside one held lock.
+
+**Why the `manual` sentinel exists.** A naive `<self-pid>` lease breaks every standalone caller: `start.sh` exits
+seconds after its `up`, so its PID is dead by the next acquire and the dead-PID sweep reaps it, downing the stack under
+a live session. The non-numeric `manual` holder-id is never swept, so a forgotten `manual` lease lingers — the
+**benign** direction (a human reaps it with `stop.sh`), never a teardown under a live run. The whole design degrades to
+"leave it UP" on any doubt, never to "tear it down."
+
+See [`scripts/check/smblease`](../../../../scripts/check/smblease/smblease.go) for the full lock/lease/policy model.
diff --git a/docs/tooling/testing.md b/docs/tooling/testing.md
@@ -96,6 +96,14 @@ entirely (mount requires permissions a headless run can't grant); Linux uses GVF
 tests have a known GVFS race in Docker (the `UDisks2VolumeMonitor` warning, see `gio mount` failures); they flake
 ~10-20% of the time. Treated as a pre-existing environmental issue, not the test's fault.
 
+**The stack is shared machine-wide.** Concurrent SMB-touching runs across git worktrees (two `check.sh` invocations, or
+a `check.sh` plus a manual `start.sh`) now coexist: every bring-up and teardown routes through a Go lease helper
+(`scripts/check/smblease`) that refcounts holders and downs the stack only when the last one leaves. So a sibling
+worktree's teardown no longer kills your live suite. If a leaked lease keeps the stack up after everything's idle, check
+state with `(cd scripts/check && go run ./smb-lease status)` and force it down with
+`rm -rf /tmp/cmdr-smb-leases && apps/desktop/test/smb-servers/stop.sh`. See `apps/desktop/test/smb-servers/README.md` §
+"Shared stack across worktrees" for the full model.
+
 ### MCP servers (for ad-hoc exploration during test writing)
 
 When the dev server is running (`pnpm dev` at repo root):
diff --git a/scripts/check/CLAUDE.md b/scripts/check/CLAUDE.md
@@ -109,7 +109,9 @@ dot | dot -Tpng -o checks.png`).
 | `stats.go`            | CSV stats logging (`logCheckStats`): appends one row per check to `~/cmdr-check-log.csv`                                           |
 | `colors.go`           | ANSI color constants                                                                                                               |
 | `utils.go`            | `findRootDir()` (walks up until `apps/desktop/src-tauri/Cargo.toml` is found)                                                      |
-| `smb_orchestrator.go` | Runner-level SMB Docker lifecycle (start once at runner init, stop at exit)                                                        |
+| `smb_orchestrator.go` | Runner-level SMB Docker lifecycle: acquires a machine-wide lease (via `smblease`) at init, releases at exit                        |
+| `smblease/`           | Library: the machine-wide flock + holder-id refcount that makes the shared `smb-consumer` stack safe across worktrees              |
+| `smb-lease/`          | Thin `package main` CLI onto `smblease` (`acquire`/`release`/`reconcile`/`status`) that the bash scripts shell out to              |
 | `freestyle.go`        | All freestyle.sh remote-VM execution logic, including `preferFreestyleRun`                                                         |
 | `checks/`             | One file per check, plus `common.go` (shared utils) and `registry.go` (the `AllChecks` ordered list)                               |
 
@@ -218,19 +220,27 @@ when deps haven't changed. A marker file (`node_modules/.pnpm-install-marker`) s
 each successful install. On the next run, if the mtime matches, install is skipped. The marker lives inside
 `node_modules/` so it's automatically invalidated if `node_modules` is deleted. Always runs in CI (`--ci`).
 
-**Decision**: SMB Docker container lifecycle is owned by a runner-level orchestrator, not per-check. **Why**: Multiple
-checks (`desktop-rust-integration-tests`, `desktop-e2e-linux`) need the shared `smb-consumer` Docker Compose project.
-Before, each owned the lifecycle: start in entry, `defer ./stop.sh` in cleanup. When both ran in parallel under
-`--include-slow`, whichever finished first would tear down containers the other was still using, producing
-`Cannot reach smb-consumer-X` cascades. `SmbOrchestrator` (`scripts/check/smb_orchestrator.go`) lifts lifecycle one
-level up: at runner init, after `selectChecks()` resolves the planned set, the orchestrator brings up the union of
-`NeedsSmb` modes (`SmbModeCore` for integration tests, `SmbModeE2E` for e2e). At runner exit (normal, `--fail-fast`, or
-SIGINT) it tears down once. Checks marked `NeedsSmb` no longer manage their own lifecycle: they assume the containers
-are up and call `waitForSmbContainers` as a cheap mid-run zombie-guard. The smaller scripts (`start.sh`,
-`e2e-linux.sh::start_smb_containers`) keep working standalone for `pnpm test:e2e:linux` invocations outside the check
-runner; under check.sh their start.sh invocation just sees the orchestrator's containers already running and probes are
-idempotent. The SIGINT handler in `main.go` captures the orchestrator via shared variable so a Ctrl+C also triggers
-`./stop.sh` with a banner before exiting 130.
+**Decision**: SMB Docker container lifecycle is owned by a runner-level orchestrator that holds a machine-wide lease,
+not per-check and not per-process. **Why**: Multiple checks (`desktop-rust-integration-tests`, `desktop-e2e-linux`) need
+the shared `smb-consumer` Docker Compose project. Two layers of contention had to be solved:
+
+- _Intra-process_: each check used to own the lifecycle (start in entry, `defer ./stop.sh` in cleanup); two in one run
+  raced each other. `SmbOrchestrator` (`scripts/check/smb_orchestrator.go`) lifts lifecycle one level up — at runner
+  init, after `selectChecks()` resolves the planned set, it brings up the union of `NeedsSmb` modes (`SmbModeCore` for
+  integration tests, `SmbModeE2E` for e2e) once, and tears down once at runner exit. Checks marked `NeedsSmb` assume the
+  containers are up and call `waitForSmbContainers` as a cheap mid-run zombie-guard.
+- _Cross-process / cross-worktree_: two `check.sh` runs (or a `check.sh` plus a manual `start.sh`) in different
+  worktrees have independent orchestrators, so the in-process map can't stop them racing the same containers. The
+  orchestrator therefore takes a **machine-wide lease** via the `smblease` library (holder-id = its own `check.sh` PID).
+  `EnsureStarted` calls `smblease.Acquire` (adopt-or-reconcile under a flock); `Stop` calls `smblease.Release` (down
+  only at zero holders, lock held across the down). The orchestrator imports the lib in-process — no subprocess —
+  because it's already Go in the same module.
+
+The standalone scripts (`start.sh`, `e2e-linux.sh::start_smb_containers`) take their **own** leases (`manual` for
+`start.sh`, `$$` for `e2e-linux.sh`), so a manual run alongside a `check.sh` run just registers as a second holder and
+neither tears the other's stack down. The SIGINT handler in `main.go` captures the orchestrator via shared variable so a
+Ctrl+C also releases the lease (with a banner) before exiting 130. See [`smblease/smblease.go`](smblease/smblease.go)
+for the lock/lease/policy model.
 
 **Decision**: cmdr's SMB stack binds a dedicated host-port range (11480+), not smb2's default (10480+). **Why**: cmdr
 runs a _vendored copy_ of smb2's `consumer` compose under its own project name (`smb-consumer`), while smb2's own test
@@ -305,33 +315,17 @@ via `[settings] disable_tools = ["pnpm"]` in `/root/.config/mise/config.toml`.
 **`--only-slow` needs ~20 min timeout.** Slow checks (E2E tests, `rust-tests-linux`) take significantly longer than the
 default checks. When running `--only-slow` via an agent or CI, set the timeout to at least 20 minutes (1,200,000 ms).
 
-**Never run two `./scripts/check.sh` invocations concurrently if either touches SMB.** The `SmbOrchestrator` is scoped
-to one runner process: it starts the `smb-consumer` Docker Compose project at runner init and tears it down at runner
-exit. Two parallel invocations get two orchestrators racing the same containers. The first to finish runs `./stop.sh`
-while the other is still mid-test, producing `Cannot reach smb-consumer-X` cascades. Symptom: a previously green check
-(typically `desktop-e2e-linux` or `desktop-rust-integration-tests`) starts failing several SMB tests with 30 s timeouts
-in the second-to-finish run.
-
-The right way to run two SMB-touching checks together is one invocation with multiple `--check` flags so the same
-orchestrator owns the containers, or sequentially. For example:
-
-```sh
-# Good: one orchestrator, shared SMB stack
-./scripts/check.sh --check desktop-e2e-linux --check desktop-e2e-playwright
-
-# Also fine: sequential
-./scripts/check.sh --check desktop-e2e-linux
-./scripts/check.sh --check desktop-e2e-playwright
-
-# Wrong: two orchestrators racing
-./scripts/check.sh --check desktop-e2e-linux &
-./scripts/check.sh --check desktop-e2e-playwright &
-```
-
-Same applies to running a check.sh invocation alongside a raw `pnpm test:e2e:linux` or
-`apps/desktop/test/smb-servers/start.sh` in another terminal — only one process should own the SMB stack at a time. The
-`e2e-linux.sh` and `start.sh` scripts are safe to run standalone when no `check.sh` is also running, but they don't
-coordinate with each other across processes.
+**Concurrent SMB-touching runs across worktrees now coexist.** Two `./scripts/check.sh` invocations in different
+worktrees (or a `check.sh` alongside a manual `start.sh` / `pnpm test:e2e:linux`) each take a machine-wide `smblease`
+lease and share the same `smb-consumer` stack. Whichever finishes first releases its lease but sees a non-zero refcount,
+so it does **not** down the stack — the other run keeps serving. The stack downs only when the last holder leaves. The
+old `Cannot reach smb-consumer-X` cascade (one run's teardown killing another's mid-test) is the exact failure the lease
+closes.
+
+A leaked or lingering stack (a forgotten manual `start.sh`, or a numeric holder whose PID got recycled) is the benign
+direction: it stays up until a human reaps it. Check state with `(cd scripts/check && go run ./smb-lease status)`; force
+it down with `rm -rf /tmp/cmdr-smb-leases && apps/desktop/test/smb-servers/stop.sh`. See
+`apps/desktop/test/smb-servers/README.md` § "Shared stack across worktrees" and `smblease/smblease.go`.
 
 ## Dependencies