@@ -109,7 +109,9 @@ dot | dot -Tpng -o checks.png`).
109109| ` stats.go ` | CSV stats logging (` logCheckStats ` ): appends one row per check to ` ~/cmdr-check-log.csv ` |
110110| ` colors.go ` | ANSI color constants |
111111| ` utils.go ` | ` findRootDir() ` (walks up until ` apps/desktop/src-tauri/Cargo.toml ` is found) |
112- | ` smb_orchestrator.go ` | Runner-level SMB Docker lifecycle (start once at runner init, stop at exit) |
112+ | ` smb_orchestrator.go ` | Runner-level SMB Docker lifecycle: acquires a machine-wide lease (via ` smblease ` ) at init, releases at exit |
113+ | ` smblease/ ` | Library: the machine-wide flock + holder-id refcount that makes the shared ` smb-consumer ` stack safe across worktrees |
114+ | ` smb-lease/ ` | Thin ` package main ` CLI onto ` smblease ` (` acquire ` /` release ` /` reconcile ` /` status ` ) that the bash scripts shell out to |
113115| ` freestyle.go ` | All freestyle.sh remote-VM execution logic, including ` preferFreestyleRun ` |
114116| ` checks/ ` | One file per check, plus ` common.go ` (shared utils) and ` registry.go ` (the ` AllChecks ` ordered list) |
115117
@@ -218,19 +220,27 @@ when deps haven't changed. A marker file (`node_modules/.pnpm-install-marker`) s
218220each successful install. On the next run, if the mtime matches, install is skipped. The marker lives inside
219221` node_modules/ ` so it's automatically invalidated if ` node_modules ` is deleted. Always runs in CI (` --ci ` ).
220222
221- ** Decision** : SMB Docker container lifecycle is owned by a runner-level orchestrator, not per-check. ** Why** : Multiple
222- checks (` desktop-rust-integration-tests ` , ` desktop-e2e-linux ` ) need the shared ` smb-consumer ` Docker Compose project.
223- Before, each owned the lifecycle: start in entry, ` defer ./stop.sh ` in cleanup. When both ran in parallel under
224- ` --include-slow ` , whichever finished first would tear down containers the other was still using, producing
225- ` Cannot reach smb-consumer-X ` cascades. ` SmbOrchestrator ` (` scripts/check/smb_orchestrator.go ` ) lifts lifecycle one
226- level up: at runner init, after ` selectChecks() ` resolves the planned set, the orchestrator brings up the union of
227- ` NeedsSmb ` modes (` SmbModeCore ` for integration tests, ` SmbModeE2E ` for e2e). At runner exit (normal, ` --fail-fast ` , or
228- SIGINT) it tears down once. Checks marked ` NeedsSmb ` no longer manage their own lifecycle: they assume the containers
229- are up and call ` waitForSmbContainers ` as a cheap mid-run zombie-guard. The smaller scripts (` start.sh ` ,
230- ` e2e-linux.sh::start_smb_containers ` ) keep working standalone for ` pnpm test:e2e:linux ` invocations outside the check
231- runner; under check.sh their start.sh invocation just sees the orchestrator's containers already running and probes are
232- idempotent. The SIGINT handler in ` main.go ` captures the orchestrator via shared variable so a Ctrl+C also triggers
233- ` ./stop.sh ` with a banner before exiting 130.
223+ ** Decision** : SMB Docker container lifecycle is owned by a runner-level orchestrator that holds a machine-wide lease,
224+ not per-check and not per-process. ** Why** : Multiple checks (` desktop-rust-integration-tests ` , ` desktop-e2e-linux ` ) need
225+ the shared ` smb-consumer ` Docker Compose project. Two layers of contention had to be solved:
226+
227+ - _ Intra-process_ : each check used to own the lifecycle (start in entry, ` defer ./stop.sh ` in cleanup); two in one run
228+ raced each other. ` SmbOrchestrator ` (` scripts/check/smb_orchestrator.go ` ) lifts lifecycle one level up — at runner
229+ init, after ` selectChecks() ` resolves the planned set, it brings up the union of ` NeedsSmb ` modes (` SmbModeCore ` for
230+ integration tests, ` SmbModeE2E ` for e2e) once, and tears down once at runner exit. Checks marked ` NeedsSmb ` assume the
231+ containers are up and call ` waitForSmbContainers ` as a cheap mid-run zombie-guard.
232+ - _ Cross-process / cross-worktree_ : two ` check.sh ` runs (or a ` check.sh ` plus a manual ` start.sh ` ) in different
233+ worktrees have independent orchestrators, so the in-process map can't stop them racing the same containers. The
234+ orchestrator therefore takes a ** machine-wide lease** via the ` smblease ` library (holder-id = its own ` check.sh ` PID).
235+ ` EnsureStarted ` calls ` smblease.Acquire ` (adopt-or-reconcile under a flock); ` Stop ` calls ` smblease.Release ` (down
236+ only at zero holders, lock held across the down). The orchestrator imports the lib in-process — no subprocess —
237+ because it's already Go in the same module.
238+
239+ The standalone scripts (` start.sh ` , ` e2e-linux.sh::start_smb_containers ` ) take their ** own** leases (` manual ` for
240+ ` start.sh ` , ` $$ ` for ` e2e-linux.sh ` ), so a manual run alongside a ` check.sh ` run just registers as a second holder and
241+ neither tears the other's stack down. The SIGINT handler in ` main.go ` captures the orchestrator via shared variable so a
242+ Ctrl+C also releases the lease (with a banner) before exiting 130. See [ ` smblease/smblease.go ` ] ( smblease/smblease.go )
243+ for the lock/lease/policy model.
234244
235245** Decision** : cmdr's SMB stack binds a dedicated host-port range (11480+), not smb2's default (10480+). ** Why** : cmdr
236246runs a _ vendored copy_ of smb2's ` consumer ` compose under its own project name (` smb-consumer ` ), while smb2's own test
@@ -305,33 +315,17 @@ via `[settings] disable_tools = ["pnpm"]` in `/root/.config/mise/config.toml`.
305315** ` --only-slow ` needs ~ 20 min timeout.** Slow checks (E2E tests, ` rust-tests-linux ` ) take significantly longer than the
306316default checks. When running ` --only-slow ` via an agent or CI, set the timeout to at least 20 minutes (1,200,000 ms).
307317
308- ** Never run two ` ./scripts/check.sh ` invocations concurrently if either touches SMB.** The ` SmbOrchestrator ` is scoped
309- to one runner process: it starts the ` smb-consumer ` Docker Compose project at runner init and tears it down at runner
310- exit. Two parallel invocations get two orchestrators racing the same containers. The first to finish runs ` ./stop.sh `
311- while the other is still mid-test, producing ` Cannot reach smb-consumer-X ` cascades. Symptom: a previously green check
312- (typically ` desktop-e2e-linux ` or ` desktop-rust-integration-tests ` ) starts failing several SMB tests with 30 s timeouts
313- in the second-to-finish run.
314-
315- The right way to run two SMB-touching checks together is one invocation with multiple ` --check ` flags so the same
316- orchestrator owns the containers, or sequentially. For example:
317-
318- ``` sh
319- # Good: one orchestrator, shared SMB stack
320- ./scripts/check.sh --check desktop-e2e-linux --check desktop-e2e-playwright
321-
322- # Also fine: sequential
323- ./scripts/check.sh --check desktop-e2e-linux
324- ./scripts/check.sh --check desktop-e2e-playwright
325-
326- # Wrong: two orchestrators racing
327- ./scripts/check.sh --check desktop-e2e-linux &
328- ./scripts/check.sh --check desktop-e2e-playwright &
329- ```
330-
331- Same applies to running a check.sh invocation alongside a raw ` pnpm test:e2e:linux ` or
332- ` apps/desktop/test/smb-servers/start.sh ` in another terminal — only one process should own the SMB stack at a time. The
333- ` e2e-linux.sh ` and ` start.sh ` scripts are safe to run standalone when no ` check.sh ` is also running, but they don't
334- coordinate with each other across processes.
318+ ** Concurrent SMB-touching runs across worktrees now coexist.** Two ` ./scripts/check.sh ` invocations in different
319+ worktrees (or a ` check.sh ` alongside a manual ` start.sh ` / ` pnpm test:e2e:linux ` ) each take a machine-wide ` smblease `
320+ lease and share the same ` smb-consumer ` stack. Whichever finishes first releases its lease but sees a non-zero refcount,
321+ so it does ** not** down the stack — the other run keeps serving. The stack downs only when the last holder leaves. The
322+ old ` Cannot reach smb-consumer-X ` cascade (one run's teardown killing another's mid-test) is the exact failure the lease
323+ closes.
324+
325+ A leaked or lingering stack (a forgotten manual ` start.sh ` , or a numeric holder whose PID got recycled) is the benign
326+ direction: it stays up until a human reaps it. Check state with ` (cd scripts/check && go run ./smb-lease status) ` ; force
327+ it down with ` rm -rf /tmp/cmdr-smb-leases && apps/desktop/test/smb-servers/stop.sh ` . See
328+ ` apps/desktop/test/smb-servers/README.md ` § "Shared stack across worktrees" and ` smblease/smblease.go ` .
335329
336330## Dependencies
337331
0 commit comments