Skip to content

Commit 7905a4e

Browse files
committed
SMB tests: PID-unique write dir + stack docs
- `run_concurrent_write_pass` now prefixes its dest dir with the PID (`{TEST_PREFIX_ROOT}{pid}-{ts}-n{n_files}`), mirroring `test_dir_name()`. The old 1-second `ts` granularity was collision-prone across concurrent worktree sessions sharing the same `smb-consumer-maxreadsize` container — the only fixed-path write the concurrency audit flagged. - Note in the soak test that a manual soak shares the machine-wide stack; a concurrent load that ramps mid-soak can inflate its relative drift ratio. - Document the shared-stack lease model: holder-id leases, the `manual` sentinel, adopt-or-start, lock-held teardown, and how to force-down a lingering stack. Updated `smb-servers/README.md` (with a Decision/Why), `scripts/check/CLAUDE.md` (orchestrator's machine-wide leasing), and `docs/tooling/testing.md` (concurrent runs coexist + leaked-stack recovery).
1 parent b430723 commit 7905a4e

5 files changed

Lines changed: 111 additions & 42 deletions

File tree

apps/desktop/src-tauri/src/file_system/volume/backends/smb_integration_test.rs

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1438,7 +1438,12 @@ async fn run_concurrent_write_pass(
14381438
.duration_since(std::time::UNIX_EPOCH)
14391439
.unwrap()
14401440
.as_secs();
1441-
let unique_prefix = format!("{TEST_PREFIX_ROOT}{ts}-n{n_files}");
1441+
// Include the PID so two concurrent runs (different worktrees sharing the
1442+
// same `smb-consumer` container) never target the same dest dir within the
1443+
// same wall-clock second. Mirrors `test_dir_name()`'s uniqueness recipe;
1444+
// `ts`'s 1-second granularity alone is collision-prone across sessions.
1445+
let pid = std::process::id();
1446+
let unique_prefix = format!("{TEST_PREFIX_ROOT}{pid}-{ts}-n{n_files}");
14421447

14431448
let dest_dir_abs = mount_path.join(unique_prefix.trim_start_matches('/'));
14441449
let _ = vol.create_directory(&mount_path.join("_test")).await;

apps/desktop/src-tauri/src/file_system/volume/backends/smb_soak_test.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,13 @@
55
//! `CMDR_SOAK_ITERATIONS` or `CMDR_SOAK_DURATION_SECS` set. Declared as a
66
//! `#[cfg(test)]` submodule of `smb`; shared helpers come from
77
//! `super::smb_test_support`.
8+
//!
9+
//! Run a manual soak alone: it shares the machine-wide `smb-consumer` stack
10+
//! with any other live SMB work (a sibling worktree's integration suite, an
11+
//! E2E run), and a concurrent load that *ramps* mid-soak can inflate the
12+
//! drift ratio below into a false failure. The drift assertion is relative
13+
//! (last-10% vs first-10%), so a *uniform* concurrent slowdown won't trip it,
14+
//! but a load that grows over the soak's lifetime can.
815
916
use super::smb_test_support::*;
1017
use super::*;

apps/desktop/test/smb-servers/README.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,3 +19,58 @@ CI runs the Rust SMB integration tests automatically via the `desktop-rust-integ
1919
Locally, `./scripts/check.sh --rust` includes the same check.
2020

2121
See [docs/guides/testing/smb-servers.md](../../../../docs/guides/testing/smb-servers.md) for the full documentation.
22+
23+
## Shared stack across worktrees (the lease)
24+
25+
The `smb-consumer` stack is a **machine-wide shared resource**. Every bring-up (this `start.sh`, the check-runner's
26+
orchestrator, `e2e-linux.sh`) and every teardown routes through a Go lease helper (`scripts/check/smblease`) so
27+
concurrent sessions in different git worktrees stop tearing each other's containers down. You don't normally interact
28+
with the lease directly — `start.sh` / `stop.sh` handle it — but here's the model:
29+
30+
- **Holder-id leases.** Each live user writes one file under `/tmp/cmdr-smb-leases/<holder-id>`, guarded by a flock on
31+
`/tmp/cmdr-smb.lock`. Bring-up **adopts** an already-serving stack (no compose call) or **reconciles** it via `up -d`;
32+
teardown removes the caller's lease and downs the stack **only when zero leases remain**.
33+
- **The `manual` sentinel.** A bare `./start.sh` registers as the holder-id `manual`. It's the one lease the dead-PID
34+
sweep never reaps (it's non-numeric), because `start.sh` exits seconds after the `up` — a PID-keyed lease would be
35+
swept on the next acquire and tear the stack down under a live session. Numeric holders (`e2e-linux.sh`'s `$$`, the
36+
orchestrator's `check.sh` PID) are long-lived processes, so their leases are swept when the process dies.
37+
- **`./stop.sh`** releases the `manual` lease. If another session still holds a lease, the stack stays **up** — running
38+
`stop.sh` while a sibling worktree's suite is live no longer kills it.
39+
40+
### Force-down a lingering stack
41+
42+
A forgotten `manual` lease (or a leaked numeric one) keeps the stack up. To reap it when you're sure nothing else needs
43+
it:
44+
45+
```bash
46+
rm -rf /tmp/cmdr-smb-leases && ./stop.sh # clear all leases, then down
47+
# or just confirm the state first:
48+
(cd ../../../../scripts/check && go run ./smb-lease status)
49+
```
50+
51+
`contention-check.sh` in this directory is the repeatable acceptance test for the whole mechanism: a dummy holder must
52+
survive another session's full acquire→run→release cycle, and the stack must down only at zero holders.
53+
54+
### Decision: holder-id leases + adopt-or-start + lock-held teardown
55+
56+
**Why a lease at all.** All worktrees resolve the same `smb-consumer` project on the same fixed host ports, so any one
57+
session's raw `compose down` (from `stop.sh`, the orchestrator's `Stop`, or `e2e-linux.sh`'s restart path) tore the
58+
shared stack out from under a live suite in another worktree, producing `Cannot reach smb-consumer-X` cascades —
59+
observed repeatedly. A second session's `up` with slightly different config could `--force-recreate` the running
60+
containers mid-run. The lease closes both races.
61+
62+
**Why adopt-or-start, not just `up -d`.** When the stack is already serving the requested config, the helper issues **no
63+
compose call** — it adopts. That's what prevents the recreate-mid-run failure. A blind `up -d` from a second session
64+
could disturb healthy containers; adoption never touches them.
65+
66+
**Why the lock is held across the `down`.** Releasing the flock before the `compose down` reopens the exact teardown
67+
race we're closing: an arriving acquirer would see zero leases, start a fresh `up` while the old `down` is mid-flight,
68+
and get half-torn-down containers. Acquire → re-verify zero → down → release all happen inside one held lock.
69+
70+
**Why the `manual` sentinel exists.** A naive `<self-pid>` lease breaks every standalone caller: `start.sh` exits
71+
seconds after its `up`, so its PID is dead by the next acquire and the dead-PID sweep reaps it, downing the stack under
72+
a live session. The non-numeric `manual` holder-id is never swept, so a forgotten `manual` lease lingers — the
73+
**benign** direction (a human reaps it with `stop.sh`), never a teardown under a live run. The whole design degrades to
74+
"leave it UP" on any doubt, never to "tear it down."
75+
76+
See [`scripts/check/smblease`](../../../../scripts/check/smblease/smblease.go) for the full lock/lease/policy model.

docs/tooling/testing.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,14 @@ entirely (mount requires permissions a headless run can't grant); Linux uses GVF
9696
tests have a known GVFS race in Docker (the `UDisks2VolumeMonitor` warning, see `gio mount` failures); they flake
9797
~10-20% of the time. Treated as a pre-existing environmental issue, not the test's fault.
9898

99+
**The stack is shared machine-wide.** Concurrent SMB-touching runs across git worktrees (two `check.sh` invocations, or
100+
a `check.sh` plus a manual `start.sh`) now coexist: every bring-up and teardown routes through a Go lease helper
101+
(`scripts/check/smblease`) that refcounts holders and downs the stack only when the last one leaves. So a sibling
102+
worktree's teardown no longer kills your live suite. If a leaked lease keeps the stack up after everything's idle, check
103+
state with `(cd scripts/check && go run ./smb-lease status)` and force it down with
104+
`rm -rf /tmp/cmdr-smb-leases && apps/desktop/test/smb-servers/stop.sh`. See `apps/desktop/test/smb-servers/README.md` §
105+
"Shared stack across worktrees" for the full model.
106+
99107
### MCP servers (for ad-hoc exploration during test writing)
100108

101109
When the dev server is running (`pnpm dev` at repo root):

scripts/check/CLAUDE.md

Lines changed: 35 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,9 @@ dot | dot -Tpng -o checks.png`).
109109
| `stats.go` | CSV stats logging (`logCheckStats`): appends one row per check to `~/cmdr-check-log.csv` |
110110
| `colors.go` | ANSI color constants |
111111
| `utils.go` | `findRootDir()` (walks up until `apps/desktop/src-tauri/Cargo.toml` is found) |
112-
| `smb_orchestrator.go` | Runner-level SMB Docker lifecycle (start once at runner init, stop at exit) |
112+
| `smb_orchestrator.go` | Runner-level SMB Docker lifecycle: acquires a machine-wide lease (via `smblease`) at init, releases at exit |
113+
| `smblease/` | Library: the machine-wide flock + holder-id refcount that makes the shared `smb-consumer` stack safe across worktrees |
114+
| `smb-lease/` | Thin `package main` CLI onto `smblease` (`acquire`/`release`/`reconcile`/`status`) that the bash scripts shell out to |
113115
| `freestyle.go` | All freestyle.sh remote-VM execution logic, including `preferFreestyleRun` |
114116
| `checks/` | One file per check, plus `common.go` (shared utils) and `registry.go` (the `AllChecks` ordered list) |
115117

@@ -218,19 +220,27 @@ when deps haven't changed. A marker file (`node_modules/.pnpm-install-marker`) s
218220
each successful install. On the next run, if the mtime matches, install is skipped. The marker lives inside
219221
`node_modules/` so it's automatically invalidated if `node_modules` is deleted. Always runs in CI (`--ci`).
220222

221-
**Decision**: SMB Docker container lifecycle is owned by a runner-level orchestrator, not per-check. **Why**: Multiple
222-
checks (`desktop-rust-integration-tests`, `desktop-e2e-linux`) need the shared `smb-consumer` Docker Compose project.
223-
Before, each owned the lifecycle: start in entry, `defer ./stop.sh` in cleanup. When both ran in parallel under
224-
`--include-slow`, whichever finished first would tear down containers the other was still using, producing
225-
`Cannot reach smb-consumer-X` cascades. `SmbOrchestrator` (`scripts/check/smb_orchestrator.go`) lifts lifecycle one
226-
level up: at runner init, after `selectChecks()` resolves the planned set, the orchestrator brings up the union of
227-
`NeedsSmb` modes (`SmbModeCore` for integration tests, `SmbModeE2E` for e2e). At runner exit (normal, `--fail-fast`, or
228-
SIGINT) it tears down once. Checks marked `NeedsSmb` no longer manage their own lifecycle: they assume the containers
229-
are up and call `waitForSmbContainers` as a cheap mid-run zombie-guard. The smaller scripts (`start.sh`,
230-
`e2e-linux.sh::start_smb_containers`) keep working standalone for `pnpm test:e2e:linux` invocations outside the check
231-
runner; under check.sh their start.sh invocation just sees the orchestrator's containers already running and probes are
232-
idempotent. The SIGINT handler in `main.go` captures the orchestrator via shared variable so a Ctrl+C also triggers
233-
`./stop.sh` with a banner before exiting 130.
223+
**Decision**: SMB Docker container lifecycle is owned by a runner-level orchestrator that holds a machine-wide lease,
224+
not per-check and not per-process. **Why**: Multiple checks (`desktop-rust-integration-tests`, `desktop-e2e-linux`) need
225+
the shared `smb-consumer` Docker Compose project. Two layers of contention had to be solved:
226+
227+
- _Intra-process_: each check used to own the lifecycle (start in entry, `defer ./stop.sh` in cleanup); two in one run
228+
raced each other. `SmbOrchestrator` (`scripts/check/smb_orchestrator.go`) lifts lifecycle one level up — at runner
229+
init, after `selectChecks()` resolves the planned set, it brings up the union of `NeedsSmb` modes (`SmbModeCore` for
230+
integration tests, `SmbModeE2E` for e2e) once, and tears down once at runner exit. Checks marked `NeedsSmb` assume the
231+
containers are up and call `waitForSmbContainers` as a cheap mid-run zombie-guard.
232+
- _Cross-process / cross-worktree_: two `check.sh` runs (or a `check.sh` plus a manual `start.sh`) in different
233+
worktrees have independent orchestrators, so the in-process map can't stop them racing the same containers. The
234+
orchestrator therefore takes a **machine-wide lease** via the `smblease` library (holder-id = its own `check.sh` PID).
235+
`EnsureStarted` calls `smblease.Acquire` (adopt-or-reconcile under a flock); `Stop` calls `smblease.Release` (down
236+
only at zero holders, lock held across the down). The orchestrator imports the lib in-process — no subprocess —
237+
because it's already Go in the same module.
238+
239+
The standalone scripts (`start.sh`, `e2e-linux.sh::start_smb_containers`) take their **own** leases (`manual` for
240+
`start.sh`, `$$` for `e2e-linux.sh`), so a manual run alongside a `check.sh` run just registers as a second holder and
241+
neither tears the other's stack down. The SIGINT handler in `main.go` captures the orchestrator via shared variable so a
242+
Ctrl+C also releases the lease (with a banner) before exiting 130. See [`smblease/smblease.go`](smblease/smblease.go)
243+
for the lock/lease/policy model.
234244

235245
**Decision**: cmdr's SMB stack binds a dedicated host-port range (11480+), not smb2's default (10480+). **Why**: cmdr
236246
runs a _vendored copy_ of smb2's `consumer` compose under its own project name (`smb-consumer`), while smb2's own test
@@ -305,33 +315,17 @@ via `[settings] disable_tools = ["pnpm"]` in `/root/.config/mise/config.toml`.
305315
**`--only-slow` needs ~20 min timeout.** Slow checks (E2E tests, `rust-tests-linux`) take significantly longer than the
306316
default checks. When running `--only-slow` via an agent or CI, set the timeout to at least 20 minutes (1,200,000 ms).
307317

308-
**Never run two `./scripts/check.sh` invocations concurrently if either touches SMB.** The `SmbOrchestrator` is scoped
309-
to one runner process: it starts the `smb-consumer` Docker Compose project at runner init and tears it down at runner
310-
exit. Two parallel invocations get two orchestrators racing the same containers. The first to finish runs `./stop.sh`
311-
while the other is still mid-test, producing `Cannot reach smb-consumer-X` cascades. Symptom: a previously green check
312-
(typically `desktop-e2e-linux` or `desktop-rust-integration-tests`) starts failing several SMB tests with 30 s timeouts
313-
in the second-to-finish run.
314-
315-
The right way to run two SMB-touching checks together is one invocation with multiple `--check` flags so the same
316-
orchestrator owns the containers, or sequentially. For example:
317-
318-
```sh
319-
# Good: one orchestrator, shared SMB stack
320-
./scripts/check.sh --check desktop-e2e-linux --check desktop-e2e-playwright
321-
322-
# Also fine: sequential
323-
./scripts/check.sh --check desktop-e2e-linux
324-
./scripts/check.sh --check desktop-e2e-playwright
325-
326-
# Wrong: two orchestrators racing
327-
./scripts/check.sh --check desktop-e2e-linux &
328-
./scripts/check.sh --check desktop-e2e-playwright &
329-
```
330-
331-
Same applies to running a check.sh invocation alongside a raw `pnpm test:e2e:linux` or
332-
`apps/desktop/test/smb-servers/start.sh` in another terminal — only one process should own the SMB stack at a time. The
333-
`e2e-linux.sh` and `start.sh` scripts are safe to run standalone when no `check.sh` is also running, but they don't
334-
coordinate with each other across processes.
318+
**Concurrent SMB-touching runs across worktrees now coexist.** Two `./scripts/check.sh` invocations in different
319+
worktrees (or a `check.sh` alongside a manual `start.sh` / `pnpm test:e2e:linux`) each take a machine-wide `smblease`
320+
lease and share the same `smb-consumer` stack. Whichever finishes first releases its lease but sees a non-zero refcount,
321+
so it does **not** down the stack — the other run keeps serving. The stack downs only when the last holder leaves. The
322+
old `Cannot reach smb-consumer-X` cascade (one run's teardown killing another's mid-test) is the exact failure the lease
323+
closes.
324+
325+
A leaked or lingering stack (a forgotten manual `start.sh`, or a numeric holder whose PID got recycled) is the benign
326+
direction: it stays up until a human reaps it. Check state with `(cd scripts/check && go run ./smb-lease status)`; force
327+
it down with `rm -rf /tmp/cmdr-smb-leases && apps/desktop/test/smb-servers/stop.sh`. See
328+
`apps/desktop/test/smb-servers/README.md` § "Shared stack across worktrees" and `smblease/smblease.go`.
335329

336330
## Dependencies
337331

0 commit comments

Comments
 (0)