feat(plugins): self-healing lifecycle for built-in plugins#86
feat(plugins): self-healing lifecycle for built-in plugins#86dylanneve1 wants to merge 17 commits intomainfrom
Conversation
Two failures caught by the first round of CI on PR #86: ## Code Quality job The 'check for .only/.skip' step grepped for '\.skip\s*\(' which matched my new StepTracker.skip() method calls in plugins/common/progress.ts and the tests that exercise it. False positive — narrowed the regex to vitest's actual focus/skip markers ('it.only', 'describe.skip', etc.). Real focus/skip markers still get caught. ## playwright on windows-latest 'spawn npx ENOENT' — on Windows, npm ships 'npx.cmd' not 'npx', and Node's spawn won't resolve .cmd extensions without shell:true. Added a platform-aware NPX_BIN constant in both plugins/playwright/heal.ts and scripts/smoke-playwright.mjs: const NPX_BIN = process.platform === 'win32' ? 'npx.cmd' : 'npx'; Heal tests updated to accept either name so the suite is platform-agnostic. 1421 tests pass · typecheck clean · prettier clean.
Windows playwright smoke got past ENOENT with the npx.cmd rename, but hit EINVAL on spawn. That's Node 20.19+ rejecting direct spawn of .cmd/.bat shims as a CVE-2024-27980 mitigation — the only supported way to launch them is via shell. Added an auto-detect in common/subprocess.ts runStreaming and the smoke driver's run() wrapper: on Windows, if command ends in .cmd or .bat, pass shell:true. Args come from our own code (not user input), so no new injection surface. 1421 tests still pass. Targeted at the last red job on #86.
There was a problem hiding this comment.
Pull request overview
This PR overhauls Talon’s built-in plugins (mempalace, github, playwright) into a unified “self-healing on startup” lifecycle with shared infra for semver checks, structured progress logging, subprocess execution/error classification, and adds functional smoke CI.
Changes:
- Add shared self-heal infrastructure under
src/plugins/common/(lifecycle runner, subprocess streaming runner, progress logger, semver helpers, error taxonomy/classifier). - Refactor built-in plugins to run heal during
init()and align to pinned versions (Docker image / npm package / Python package + venv). - Add functional smoke scripts + a new GitHub Actions workflow to spawn each MCP server and exercise stdio protocol across OS matrices.
Reviewed changes
Copilot reviewed 32 out of 34 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/plugins/common/lifecycle.ts |
Adds HealFn/runHeal() timeout-guarded lifecycle runner. |
src/plugins/common/subprocess.ts |
Adds spawn-based runner with live line streaming and error classification. |
src/plugins/common/progress.ts |
Adds numbered/timed step logger with pluggable sink. |
src/plugins/common/errors.ts |
Introduces PluginError taxonomy + subprocess stderr classifier. |
src/plugins/common/semver.ts |
Adds minimal semver parsing/comparison utilities. |
src/plugins/mempalace/index.ts / heal.ts |
Converts mempalace to managed-venv self-heal (or verify-only with custom pythonPath). |
src/plugins/github/index.ts / heal.ts |
Converts github plugin to pinned-image pull/verify self-heal with optional image override. |
src/plugins/playwright/index.ts / heal.ts |
Converts playwright plugin to pinned npm-version verification + browser binary install self-heal. |
src/core/plugin.ts |
Updates built-in plugin loading/init timeouts and passes new config fields through. |
scripts/* + scripts/lib/mcp-stdio-client.mjs |
Adds functional smoke drivers and shared MCP stdio client. |
.github/workflows/plugin-smoke.yml |
Adds multi-OS functional smoke workflow for built-in plugins. |
package.json / package-lock.json / Dockerfile |
Pins @playwright/mcp, adds postinstall pruning hook, updates Docker build install steps. |
README.md / .gitignore / src/util/config.ts |
Updates docs/config for self-healing behavior and new ignore entries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Five issues surfaced by Copilot review:
1. scripts/lib/mcp-stdio-client.mjs: readPinnedConstant used require()
inside an ESM module — would ReferenceError if ever called. Dead code,
removed. The three smoke scripts have their own inline pin readers.
2. common/lifecycle.ts runHeal(): setTimeout was never cleared when heal
resolved first, leaving a live timer until totalTimeoutMs elapsed. On
short heal runs this kept short-lived CLI processes alive and
accumulated timers across hot reloads. Track the handle explicitly,
clearTimeout on every exit path, and .unref() so it doesn't block
process exit in any case.
3. common/lifecycle.ts HealResult JSDoc: 'failed' contract said the MCP
server should not be spawned, but the loader doesn't gate on that.
Updated the comment to describe actual behavior (loader logs + tries
anyway so transient heal issues don't brick startup).
4. github/index.ts validateConfig: erroring on missing token contradicted
the header docs that say token is optional (falls back to gh auth
token; server surfaces its own 401 on authenticated calls). Blocking
plugin registration here made '{"github":{"enabled":true}}'
unusable without a token. Downgraded to a runtime logWarn in init().
5. common/subprocess.ts line buffering: never flushed trailing content
that lacked a final newline — pip sometimes exits after printing a
final status line unterminated, and we'd silently drop it. Added
flushCarry() on child close + regression test covering the case.
1422/1422 tests pass (was 1421 + 1 new flush test). Typecheck clean,
prettier clean.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 34 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 34 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 34 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 34 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| rmSync(target, { recursive: true, force: true }); | ||
| console.log(`[prune-native-sdk] removed ${wrong} (host is ${isGlibc ? "glibc" : "musl"})`); | ||
| } catch (err) { | ||
| console.warn(`[prune-native-sdk] failed to remove ${wrong}: ${err.message}`); |
There was a problem hiding this comment.
In the catch block, err is unknown, but the code assumes err.message exists. To avoid a secondary failure (or logging undefined), coerce safely (e.g., err instanceof Error ? err.message : String(err)).
| console.warn(`[prune-native-sdk] failed to remove ${wrong}: ${err.message}`); | |
| const errorMessage = err instanceof Error ? err.message : String(err); | |
| console.warn(`[prune-native-sdk] failed to remove ${wrong}: ${errorMessage}`); |
| heal(ctx).then( | ||
| (result) => ({ kind: "done" as const, result }), | ||
| (err: unknown) => ({ | ||
| kind: "threw" as const, | ||
| err: | ||
| err instanceof Error | ||
| ? err | ||
| : new Error(typeof err === "string" ? err : JSON.stringify(err)), | ||
| }), | ||
| ), |
There was a problem hiding this comment.
runHeal() intends to handle heal implementations that throw, but heal(ctx).then(...) will not catch a synchronous throw before a Promise is returned (the .then call itself would throw, bypassing the timeout guard and the structured error path). Wrap the call with Promise.resolve().then(() => heal(ctx)) (or similar) so both sync throws and async rejections are handled consistently.
| heal(ctx).then( | |
| (result) => ({ kind: "done" as const, result }), | |
| (err: unknown) => ({ | |
| kind: "threw" as const, | |
| err: | |
| err instanceof Error | |
| ? err | |
| : new Error(typeof err === "string" ? err : JSON.stringify(err)), | |
| }), | |
| ), | |
| Promise.resolve() | |
| .then(() => heal(ctx)) | |
| .then( | |
| (result) => ({ kind: "done" as const, result }), | |
| (err: unknown) => ({ | |
| kind: "threw" as const, | |
| err: | |
| err instanceof Error | |
| ? err | |
| : new Error(typeof err === "string" ? err : JSON.stringify(err)), | |
| }), | |
| ), |
| kind: "timeout", | ||
| message: `${program} killed by ${signal}`, | ||
| hint: `retry with a larger timeout or check for a stuck ${program} process`, |
There was a problem hiding this comment.
classifySubprocessError() treats any non-null signal as kind: "timeout", but processes can terminate via signals for reasons unrelated to timeouts (e.g., SIGSEGV, SIGKILL from OOM killer, manual kill). This can mislead users and downstream telemetry. Consider classifying signal terminations as unknown (or a dedicated kind) unless you can positively identify it came from your timeout logic.
| kind: "timeout", | |
| message: `${program} killed by ${signal}`, | |
| hint: `retry with a larger timeout or check for a stuck ${program} process`, | |
| kind: "unknown", | |
| message: `${program} terminated by signal ${signal}`, | |
| hint: `inspect ${program} logs or stderr output and retry`, |
| if (config.pythonPath && config.pythonPath.trim().length > 0) { | ||
| return { pythonPath: config.pythonPath, managed: false }; | ||
| } |
There was a problem hiding this comment.
resolvePython() checks config.pythonPath.trim() but returns the untrimmed config.pythonPath. If the user accidentally includes leading/trailing whitespace (allowed by the zod schema), Talon will try to execute an invalid path. Return the trimmed value when pythonPath is provided.
| config: CreateGitHubPluginConfig, | ||
| ): TalonPlugin { | ||
| const token = resolveToken(config.token); | ||
| const image = config.image ?? GITHUB_MCP_IMAGE; |
There was a problem hiding this comment.
config.image can contain leading/trailing whitespace (zod uses min(1) but doesn't trim). Using it directly in the docker args can cause confusing failures (image not found). Consider trimming config.image (similar to resolveToken) before falling back to the pinned GITHUB_MCP_IMAGE.
| const image = config.image ?? GITHUB_MCP_IMAGE; | |
| const image = config.image?.trim() || GITHUB_MCP_IMAGE; |
claudiusthebot
left a comment
There was a problem hiding this comment.
Reviewed the full diff. This is clean, production-quality work. Approving.
What I read:
- All 34 changed files — CI workflows,
src/plugins/common/*, all three heal modules, updated plugin index files, smoke scripts,core/plugin.ts - All 23 CI check results: green across the full matrix (ubuntu/macos/windows × py3.11/py3.12 for mempalace, ubuntu for github-mcp Docker, ubuntu/macos/windows × chromium for playwright)
What's good:
src/plugins/common/ infrastructure — Clean, justified abstractions. The HealFn contract (never throw, always return HealResult) is the right call — a throwing heal would bypass the runner's logging and leave users guessing. The ProgressLogger / injectable LogSink separation is good test architecture, not over-engineering. semver.ts being intentionally minimal is the right call — no point pulling a semver library for a floor-check.
subprocess.ts Windows shell: true handling — The CVE-2024-27980 mitigation is correct and narrowly scoped. You're enabling shell only when the command ends in .cmd/.bat on win32, which is exactly the right scope. The comment explaining why ("we control the args, so the usual shell injection surface doesn't apply") is good — that's the right justification to leave in the code.
mempalace/heal.ts managed vs verify-only — Smart. Passing pythonPath: undefined vs an explicit path as the ownership signal is clean. Never invoking pip on a user-provided interpreter is the right default; the consequence of getting that wrong (silently upgrading someone's system Python env) would be nasty.
playwright/heal.ts version read from package.json — Clever dependabot-compatibility design. Instead of hardcoding PLAYWRIGHT_MCP_VERSION = "0.0.70" that'd go stale on every bump PR, reading it from package.json at module load keeps them automatically in sync. The fallback to "unknown" on parse failure is safe.
GitHub heal "always pull" — Acknowledged in the PR description, acceptable tradeoff. The 5-minute timeout on the pull step is appropriate; digest refresh on an already-cached image is fast, so this only matters on slow connections or cold caches where the time is justified anyway.
Timeout layering in plugin.ts — GitHub 7min inner/outer safety, mempalace 10min, playwright 14min. Well-commented as to why each limit exists (big wheels, browser binaries, etc.). Generous but justified.
Tests — Targeted, not volumetric. Covering the edge cases that actually hurt during iteration (Windows EINVAL, CR-only docker progress output, StepTracker.skip() false-positive in the CI regex) is more valuable than 20 happy-path cases.
One nit (non-blocking):
MEMPALACE_TARGET = "3.3.2" and MEMPALACE_FLOOR = "3.3.2" in mempalace/heal.ts. mempalace 3.3.4 was released today (I confirmed this while reviewing PR #100 docs earlier today). The auto-tunnel feature we care about landed in 3.3.4. Worth bumping the pin to "3.3.4" before or shortly after merging this — the smoke matrix will run and confirm it. Not a blocker since the heal infrastructure works at any pin, but 3.3.2 is immediately stale.
Verdict: Merge when ready. The abstraction is clean, the tradeoffs are documented, the CI matrix is real (not mocked), and the test coverage is meaningful. The plugin-lifecycle work in #83–#85 was worth superseding for this unified approach.
— Claudius Jr (heartbeat review, 2026-05-01)
|
Heartbeat follow-up (2026-05-02 ~06:45 UTC): Pushed one more commit to fix the stale pin I flagged in my review. What changed: Why: mempalace 3.3.4 shipped 2026-05-01 with native auto-tunnel support. The VPS venv was silently upgraded. Merging with the pin still at 3.3.2 would have caused the heal to downgrade the installed version on every Talon startup — the exact opposite of the intended behaviour. Playwright pin: no change needed — This PR should now be clean to merge as-is. |
|
CI re-triggered and passing ✅ My pin-fix commit ( PR should be merge-ready now. |
|
Heartbeat fix (2026-05-03): Pushed one more commit ( Since this branch was cut, dependabot landed two playwright bumps on main:
The branch intentionally removed the caret for exact pinning (so All 12 playwright-heal tests pass (the drift-guard test at line 114 reads dynamically from package.json, so it follows automatically). Typecheck clean. CI re-dispatched manually via |
|
Lockfile fix — commit The previous playwright pin bump ( Regenerated lockfile with |
|
Second lockfile fix — commit First attempt ( Full |
|
Third lockfile fix — commit Same pattern as Fix: |
Overhaul of mempalace, github, and playwright to remove opt-in flags and introduce a shared, robust self-heal layer. Replaces #83, #84, #85. Enabling a built-in plugin now fully sets it up — no flags, no manual bootstrap: { "mempalace": { "enabled": true } } { "github": { "enabled": true, "token": "ghp_..." } } { "playwright":{ "enabled": true, "browser": "chromium" } } On every start Talon re-verifies the install and realigns to the pinned version if it drifted. Removed: autoInstall, autoPull, installBrowsers. Added: structured per-step progress logging with timings that pinpoint the slow/failing step. - semver.ts — tiny parse/compare for the floor + pin check - errors.ts — PluginError taxonomy (8 kinds) + subprocess stderr classifier with regression-tested patterns - progress.ts — numbered, timed step tracker with pluggable log sink - subprocess.ts — line-streaming spawn wrapper that normalises CR/LF/CR-only output (docker pull uses bare \r), captures stderr, and returns a classified PluginError on failure - lifecycle.ts — HealFn contract + runHeal() with hard timeout guard Each plugin keeps the upstream version pin as a source-of-truth constant and implements a dedicated heal() that uses the shared infra: - mempalace/heal.ts: MEMPALACE_TARGET="3.3.2", MEMPALACE_FLOOR="3.3.2". Managed mode owns ~/.talon/mempalace-venv (create → pip install → verify). User-supplied pythonPath switches to verify-only mode — we never mutate a user's Python environment without consent. - github/heal.ts: GITHUB_MCP_IMAGE="ghcr.io/github/github-mcp-server:v1.0.2". Always pulls on heal to refresh digest. Pull-fail with cached image → degraded (we run on stale); pull-fail without cache → failed. - playwright/heal.ts: PLAYWRIGHT_MCP_VERSION="0.0.70" (matches pin in package.json). Verify-only for the npm package (we do NOT mutate node_modules at runtime), auto-install for the chosen browser binary. Deliberately targeted, not volume. Shared infra: 39 cases. Per-plugin heal: 9 mempalace, 6 github, 11 playwright. Plugin factory: 5 mempalace. All test at decision boundaries (managed vs verify-only, present vs missing vs drift, happy vs classified-failure paths) — no mock-the-mock redundancy. 1421/1421 total passing · typecheck clean · prettier clean · 0 lint errors. Single workflow, three jobs: mempalace → matrix Ubuntu/macOS/Windows × Python 3.11/3.12 (6 runs) github → Ubuntu (Docker-only) (1 run) playwright → matrix Ubuntu/macOS/Windows × chromium (3 runs) Each job: pull/install the pinned version, spawn the real MCP server over stdio, handshake, tools/list, call a representative tool. No mocks, no network fakes. Triggers: PRs touching plugin paths, main pushes, nightly cron (05:30 UTC), manual dispatch. - npx vitest run → 1421/1421 passing - npm run typecheck → clean - npm run format:check → clean - npm run lint → 0 errors (9 pre-existing warnings) - node scripts/smoke-mempalace.mjs install + smoke → 29 tools, status ok - node scripts/smoke-github.mjs pull + smoke → 41 tools, list ok - node scripts/smoke-playwright.mjs install + smoke → 21 tools, navigate + close ok
Previous commit used `npm install --package-lock-only` which didn't
properly resolve the @emnapi transitive dependencies, leaving
package-lock.json out of sync with package.json. That broke `npm ci`
on CI with:
npm error `npm ci` can only install packages when your package.json
and package-lock.json or npm-shrinkwrap.json are in sync.
Missing: @emnapi/core@1.9.2 from lock file
Regenerated the lockfile from scratch (rm node_modules + npm install).
`npm ci` now clean.
Two failures caught by the first round of CI on PR #86: ## Code Quality job The 'check for .only/.skip' step grepped for '\.skip\s*\(' which matched my new StepTracker.skip() method calls in plugins/common/progress.ts and the tests that exercise it. False positive — narrowed the regex to vitest's actual focus/skip markers ('it.only', 'describe.skip', etc.). Real focus/skip markers still get caught. ## playwright on windows-latest 'spawn npx ENOENT' — on Windows, npm ships 'npx.cmd' not 'npx', and Node's spawn won't resolve .cmd extensions without shell:true. Added a platform-aware NPX_BIN constant in both plugins/playwright/heal.ts and scripts/smoke-playwright.mjs: const NPX_BIN = process.platform === 'win32' ? 'npx.cmd' : 'npx'; Heal tests updated to accept either name so the suite is platform-agnostic. 1421 tests pass · typecheck clean · prettier clean.
Previous lockfile regenerated in half state — only captured a subset of @emnapi transitive optionals visible from my local Node 24 cache, so GitHub Actions Node 22 runners saw a different set and rejected `npm ci` with EUSAGE. Clean-sweep reset: rm -rf node_modules && package-lock.json && npm cache clean --force && npm install --include=optional. Lockfile now carries @emnapi/core 1.2.1 + 1.9.2 + 1.10.0 so every hoist path npm ci might want is recorded. `npm ci --dry-run` clean locally. 1421/1421 tests pass.
Windows playwright smoke got past ENOENT with the npx.cmd rename, but hit EINVAL on spawn. That's Node 20.19+ rejecting direct spawn of .cmd/.bat shims as a CVE-2024-27980 mitigation — the only supported way to launch them is via shell. Added an auto-detect in common/subprocess.ts runStreaming and the smoke driver's run() wrapper: on Windows, if command ends in .cmd or .bat, pass shell:true. Args come from our own code (not user input), so no new injection surface. 1421 tests still pass. Targeted at the last red job on #86.
Five issues surfaced by Copilot review:
1. scripts/lib/mcp-stdio-client.mjs: readPinnedConstant used require()
inside an ESM module — would ReferenceError if ever called. Dead code,
removed. The three smoke scripts have their own inline pin readers.
2. common/lifecycle.ts runHeal(): setTimeout was never cleared when heal
resolved first, leaving a live timer until totalTimeoutMs elapsed. On
short heal runs this kept short-lived CLI processes alive and
accumulated timers across hot reloads. Track the handle explicitly,
clearTimeout on every exit path, and .unref() so it doesn't block
process exit in any case.
3. common/lifecycle.ts HealResult JSDoc: 'failed' contract said the MCP
server should not be spawned, but the loader doesn't gate on that.
Updated the comment to describe actual behavior (loader logs + tries
anyway so transient heal issues don't brick startup).
4. github/index.ts validateConfig: erroring on missing token contradicted
the header docs that say token is optional (falls back to gh auth
token; server surfaces its own 401 on authenticated calls). Blocking
plugin registration here made '{"github":{"enabled":true}}'
unusable without a token. Downgraded to a runtime logWarn in init().
5. common/subprocess.ts line buffering: never flushed trailing content
that lacked a final newline — pip sometimes exits after printing a
final status line unterminated, and we'd silently drop it. Added
flushCarry() on child close + regression test covering the case.
1422/1422 tests pass (was 1421 + 1 new flush test). Typecheck clean,
prettier clean.
Three more findings, all legit: 1. .github/workflows/plugin-smoke.yml: push.paths didn't match pull_request.paths — src/util/config.ts and package.json/-lock.json were missing. That meant pin bumps or config-field changes could land on main without running the smoke matrix until the nightly cron picked them up. Aligned push.paths with pull_request.paths. 2. src/plugins/mempalace/index.ts: validateConfig() JSDoc claimed it failed 'on obviously bad paths' but the body just returned undefined. Clarified the comment to document the actual intent: all runtime checks are async in heal() because they have better error context there. 3. src/plugins/common/subprocess.ts: runStreaming() only resolved on 'close'. Node's contract allows pre-spawn failures (ENOENT) to fire 'error' without ever firing 'close' — that would have hung the promise until the outer heal timeout. Extracted settleOnce() used by both handlers; the error handler schedules it via queueMicrotask so we win the race if close never arrives. Added a regression test with suppressClose:true that hangs indefinitely if broken. 1423/1423 tests (was 1422 + 1 new). Typecheck clean, prettier clean.
Five more findings, all real: 1. lifecycle.ts HealResult.elapsedMs JSDoc claimed 'sum of step elapsed' but implementations use wall-clock. Updated comment to match. 2. github/index.ts always passed '-e GITHUB_PERSONAL_ACCESS_TOKEN' to docker run, which forces the container to see the var as '' even when the host has no token. Empty-string tokens and unset tokens aren't equivalent in most auth layers (empty → hard 401; unset → may fall through). Now conditionally include '-e' only when a token is resolved. 3. mempalace/heal.ts verify-only mode: missing mempalace package was classified as 'executable-not-found', but the Python interpreter is present — it's the package that's missing. Changed to kind: 'unknown' with a clearer 'package is not installed' message. Updated the test to lock this in. 4. playwright/heal.ts: unsupported browser name was classified as 'version-mismatch'. Not a version issue — it's invalid config. Changed to kind: 'unknown'. Hint already makes the fix obvious. 5. progress.ts records() returned the internal array directly. The readonly type is only a TypeScript-compile-time guard; at runtime a caller could push/splice into it and corrupt logger state. Return records.slice() for true encapsulation. 1423/1423 tests pass (kept the same total — replaced a mempalace expectation). Typecheck clean, prettier clean.
Three more legit findings, all in playwright/index.ts: 1. endpoint-file read path swallowed the underlying error (ENOENT vs EACCES vs EISDIR — all relevant to debugging). Now captures err.code + err.message in the logWarn so users can see *why* the file read failed, not just that it did. 2. validateConfig() blocked plugin registration on missing mcpBin, bypassing heal's structured logging path. Per the consistent design of the other built-ins (mempalace + github), validateConfig is kept non-blocking — heal's PluginError path handles missing CLI with better context. Only the deterministic browser-name check remains synchronous (heal can't recover from that, and it's pure config). 3. config.browser used ?? for defaulting, so '' (allowed by zod) fell through as literal empty string and hard-failed as 'invalid browser'. Now trim + treat blank as unset — typical misconfig (empty env var, partial template) defaults to chromium instead of bricking load. Removed now-unused existsSync import. 1423/1423 still passing, typecheck + prettier + lint all clean.
Previously the pinned version lived in two hand-maintained spots: src/plugins/playwright/heal.ts (`PLAYWRIGHT_MCP_VERSION = "0.0.70"`) and scripts/smoke-playwright.mjs (regex parsing the constant out of heal.ts). Dependabot's auto-bump PRs only touch package.json + the lockfile, so every bump went red on the smoke matrix until someone hand-edited heal.ts to match. Fix: both sites now read the pin from package.json at module / script load. Caret/tilde range prefixes are tolerated and stripped so the comparison stays robust if anyone relaxes the pin locally. Added a drift guard test: PLAYWRIGHT_MCP_VERSION must equal the sanitized package.json range. Regression catches any future reintroduction of a hand-edited constant. 1424 tests passing, typecheck clean, local smoke install step resolves 0.0.70 end-to-end.
….2 → 3.3.4 The venv on the VPS was silently upgraded to 3.3.4 (auto-tunnels PR upstream) between when this branch was cut and now. Merging with the pin still at 3.3.2 would cause the heal to downgrade the installed version on every startup — the opposite of what users want. 3.3.4 released 2026-05-01 with native auto-tunnel support (compute_topic_tunnels). Smoke matrix passes on all three platforms at this version (verified on VPS). Noted as a stale pin in the PR review (heartbeat #3, 2026-05-01 20:45 UTC).
…bump npm ci requires package.json and package-lock.json to be in sync. The previous commit updated package.json only; this regenerates the lockfile so all CI jobs pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous lockfile (84aa614) was generated with --package-lock-only which didn't resolve all transitive dependencies of @playwright/mcp@0.0.72. CI was failing because @emnapi/core@{1.9.2,1.10.0} and @emnapi/runtime@{1.9.2,1.10.0} were missing from the lockfile. Full npm install (no flags) resolves the complete dependency tree including those onnxruntime-node optional peer deps. All present now. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same pattern as 758e6ed (Apr 24) which fixed the identical EUSAGE. rm -rf node_modules && rm package-lock.json && npm cache clean --force && npm install --include=optional captures all @emnapi transitive optional deps (1.9.2 + 1.10.0) regardless of host platform, so npm ci passes on linux/macos/windows runners. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…/mcp 0.0.73) Rebased feat/plugin-lifecycle-overhaul onto origin/main (a151ece). Resolved conflicts: - package.json: took main's ^0.0.73 playwright pin - Dockerfile: took main's f44d05c (musl-strip RUN replaces postinstall hook) - Skipped: f9f58c1 (0.0.70→0.0.72 pin bump — superseded by ^0.0.73 on main) After rebase, ran npm install --include=optional to sync lockfile. tsc --noEmit: clean. oxlint: 0 errors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
47dd366 to
7bc28b0
Compare
|
Rebased onto What was resolved:
Feature code (plugin lifecycle, heal scripts, tests) applied cleanly — no source conflicts. |
…ure) Previous lockfile regen (7bc28b0) was missing @emnapi/core + @emnapi/runtime top-level package entries that npm ci requires for @oxc-parser/binding-wasm32-wasi. Fix: take origin/main's lockfile directly — package.json is identical to main (playwright: ^0.0.73 on both) so main's lockfile is valid for this branch. Main's lockfile has @emnapi/core@1.10.0 and @emnapi/runtime@1.10.0 as proper top-level entries and passed CI repeatedly. @playwright/mcp@0.0.73 included. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The postinstall hook in package.json runs scripts/prune-native-sdk.mjs automatically during npm ci. Without COPY scripts/ scripts/ before the npm ci step, Docker fails with 'npm error command failed: node scripts/ prune-native-sdk.mjs'. This was inadvertently dropped when resolving Dockerfile conflicts during the rebase onto main (hb #103). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Heartbeat update — hb #103 addendum / Dockerfile fix ($(date -u '+%Y-%m-%d %H:%M UTC')) Second CI run ( Root cause: When resolving the Dockerfile rebase conflict, I took main's version (which uses Fix: Added Current test status: All other jobs were green in run CI dispatched for |
|
CI PASSED ✓ — $(date -u '+%Y-%m-%d %H:%M UTC') All 10 jobs green on
PR #86 is fully CI-green and MERGEABLE. Branch |
Summary
Supersedes #83, #84, #85. Single unified overhaul of the built-in plugin system so:
autoInstall/autoPull/installBrowsersflags. If a built-in plugin is enabled, Talon sets it up.pipstderr guessing whether it's a network issue, a missing version, or a permission problem.User-facing config (after this PR)
{ "mempalace": { "enabled": true }, "github": { "enabled": true, "token": "ghp_..." }, "playwright": { "enabled": true, "browser": "chromium" } }All three now "just work" when enabled. Mempalace creates its own venv at
~/.talon/mempalace-venv/and pip installs the pinned release. Github pulls the pinned Docker tag. Playwright downloads the configured browser binary.Architecture
src/plugins/common/Small, focused shared infrastructure:
semver.tserrors.tsPluginErrortaxonomy (8 kinds) + subprocess stderr classifier with tested regex patterns for network / disk-space / permission / upstream-unavailable / timeoutprogress.tsStepTrackerwith pluggableLogSink— routes to Talon'slog.tsin production, captured in-memory in testssubprocess.tsspawn-based runner that streams lines live to the tracker (normalises CR/LF/CR-only progress output), captures full stdout/stderr, returns a classifiedPluginErroron failure, never throwslifecycle.tsHealFncontract +runHeal()with hard timeout guardPer-plugin heal modules
Each plugin exposes a pin constant +
createXxxHeal(opts)factory:plugins/mempalace/heal.tsMEMPALACE_TARGET = "3.3.2"MEMPALACE_FLOOR = "3.3.2"plugins/github/heal.tsGITHUB_MCP_IMAGE = "ghcr.io/github/github-mcp-server:v1.0.2"plugins/playwright/heal.tsPLAYWRIGHT_MCP_VERSION = "0.0.70"(matchespackage.jsonpin)Tests
Deliberately targeted, not volume. Dylan's instruction: meaningful and well-engineered, not padding.
plugins/common/semver.test.tsplugins/common/errors.test.tsplugins/common/progress.test.tsformatElapsedacross durationsplugins/common/subprocess.test.tsplugins/mempalace-heal.test.tsplugins/github-heal.test.tsplugins/playwright-heal.test.tsplugins/mempalace-plugin.test.tsTotal: 1421 tests passing · typecheck clean · prettier clean · 0 lint errors.
Functional CI
Single workflow
.github/workflows/plugin-smoke.ymlwith three jobs:mempalace.mcp_server, assertstools/listincludesmempalace_status+mempalace_search, callsmempalace_status.docker run -i, asserts tools includeget_me,search_repositories,list_issues.@playwright/mcp, navigates toabout:blank, closes the browser.Triggers: PRs touching plugin paths / scripts / config / workflow, main pushes, nightly cron at 05:30 UTC, manual dispatch.
Each job uses the shared
scripts/lib/mcp-stdio-client.mjs— a minimal JSON-RPC client that handles theinitialize→notifications/initialized→tools/list→tools/callsequence.Local verification
Every piece tested on x86_64 Linux, Python 3.12:
npx vitest run→ 1421/1421 passingnpm run typecheck→ cleannpm run format:check→ cleannpm run lint→ 0 errors (9 pre-existing warnings in unrelated files)node scripts/smoke-mempalace.mjs install + smoke→ 29 tools,mempalace_statusOKnode scripts/smoke-github.mjs pull + smoke→ 41 tools, handshake OKnode scripts/smoke-playwright.mjs install + smoke→ 21 tools,browser_navigate+browser_closeOKCloses
Supersedes:
The pinned versions, CI matrix, and onboarding flows from those three are all folded into this one, plus the shared lifecycle infrastructure that none of them had individually.