Skip to content

Reap orphaned servers when their owning editor dies#495

Merged
dsarno merged 5 commits into
mainfrom
server-orphan-reaper
May 30, 2026
Merged

Reap orphaned servers when their owning editor dies#495
dsarno merged 5 commits into
mainfrom
server-orphan-reaper

Conversation

@dsarno

@dsarno dsarno commented May 30, 2026

Copy link
Copy Markdown
Contributor

Problem

The plugin spawns the MCP server as a detached child (OS.create_process) so it survives a plugin reload and can be re-adopted. The only teardown is the editor's clean _exit_tree → stop_server. A crash or hard-kill skips that, leaving the server squatting on ports 8000/9500 until a human or the next session's port reconciliation kills it — which is how a fresh session inherits a stale older-version server (the exact thing seen during the audit-PR smoke).

Fix

Tie the server's life to its owner:

  • When the plugin auto-spawns the server it sets GODOT_AI_OWNER_PID to the editor's PID (server_lifecycle.gd).
  • The server runs a lightweight watchdog (orphan_reaper.watch_owner) that polls that PID and, once the owner editor is gone and no editor session is connected, shuts itself down via SIGTERM (the graceful path uvicorn already handles).
  • The no-sessions guard preserves adoption: a different editor that adopted the server holds a live WebSocket session, so the watchdog leaves it running. That adopter's own clean exit (or later crash, by which point sessions are again zero) reaps it.

Why an env var, not a --owner-pid flag, for the plugin→server channel: an older server in a staggered user-mode upgrade would reject an unknown argparse flag and fail to spawn, whereas it silently ignores an unknown env var. The --owner-pid CLI flag still exists for tests/manual use; main() prefers it and falls back to the env var.

Servers started without an owner (CI's ci-start-server, manual --reload dev, uvx one-shots) never arm the watchdog and behave exactly as before.

Verification

  • pytest: 1106 passed — 6 new orphan_reaper unit tests (reap-when-dead, no-reap-while-alive, no-reap-when-adopted, reap-after-adopter-leaves, pid_alive), owner-pid plumbing tests (flag / env / malformed-env), and updated the 3 create_server stubs for the new kwarg.
  • Live end-to-end on Godot 4.6.3: plugin spawned the worktree server; log showed Orphan reaper armed for owner editor pid <N>; kill -9 of the editor (crash simulation, bypassing _exit_tree) reaped the server in ~1s and freed both ports.
  • 0 GDScript parse errors on Godot 4.6.3 and 4.3.

🤖 Generated with Claude Code

The plugin spawns the MCP server as a detached child (OS.create_process) so it
survives a plugin reload and can be re-adopted. The only teardown is the
editor's clean _exit_tree -> stop_server; a crash or hard-kill skips that and
leaves the server squatting on ports 8000/9500 until a human or the next
session's port reconciliation kills it. That's how a fresh session inherits a
stale older-version server.

Tie the server's life to its owner: when the plugin auto-spawns it, it now sets
GODOT_AI_OWNER_PID to the editor's PID. The server runs a lightweight watchdog
(orphan_reaper.watch_owner) that polls that PID and, once the owner editor is
gone AND no editor session is connected, shuts itself down via SIGTERM (the
same graceful path uvicorn already handles). The "no sessions" guard preserves
adoption: a different editor that adopted the server holds a live WebSocket
session, so the watchdog leaves it running.

Why an env var, not a --owner-pid flag, for the plugin->server channel: an
older server in a staggered user-mode upgrade would reject an unknown argparse
flag and fail to spawn, whereas it silently ignores an unknown env var. The
CLI flag still exists for tests/manual use; main() prefers it and falls back to
the env var.

Servers started without an owner (CI's ci-start-server, manual --reload dev,
uvx one-shots) never arm the watchdog and behave exactly as before.

Verified:
- pytest: 1106 passed (6 new orphan_reaper unit tests + owner-pid plumbing
  tests; updated 3 create_server stubs for the new kwarg).
- Live end-to-end on Godot 4.6.3: plugin spawned the worktree server, log
  showed "Orphan reaper armed for owner editor pid <N>"; SIGKILL of the editor
  (crash sim) reaped the server in ~1s and freed both ports.
- 0 GDScript parse errors on Godot 4.6.3 and 4.3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codecov

codecov Bot commented May 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 71.62162% with 21 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/godot_ai/orphan_reaper.py 74.00% 13 Missing ⚠️
src/godot_ai/server.py 38.46% 8 Missing ⚠️

📢 Thoughts on this report? Let us know!

The reaper's liveness probe used os.kill(owner_pid, 0). On POSIX that's a
harmless existence check, but on Windows os.kill has no signal-0 concept — any
non-CTRL signal calls TerminateProcess. So on Windows the first poll would have
*killed the owner editor* it was meant to check (then crashed the reaper task
on the next poll with an uncaught OSError). CI never caught this because
ci-start-server starts the server without an owner pid, so the reaper is never
armed in CI on any platform; the macOS live smoke was the only place it ran.

Fix, conservatively:
- should_arm_reaper() disables the reaper on Windows (sys.platform win*). The
  plugin also skips setting GODOT_AI_OWNER_PID on Windows. Windows keeps its
  pre-change behavior: clean editor shutdown still stops the server; a hard-
  crash orphan lingers until the next session's port reconciliation (status
  quo, no regression). Tracked as a follow-up to validate a Windows liveness
  path on real Windows before enabling.
- pid_alive() no longer uses os.kill on Windows: it opens a SYNCHRONIZE handle
  and asks WaitForSingleObject (never TerminateProcess). Defensive only today
  since the reaper is gated off there, but correct if ever enabled / called.
- pid_alive() POSIX path also catches generic OSError -> treat as alive
  (conservative: never reap on an ambiguous errno).

Unit tests cover should_arm_reaper across posix/windows + invalid pids.
macOS live reap re-verified after the gate (server reaped ~4s, ports freed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dsarno

dsarno commented May 30, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up (cfa0846): found and fixed a Windows landmine while smoke-testing harder.

The liveness probe used os.kill(owner_pid, 0). POSIX: harmless existence check. Windows: os.kill has no signal-0 — any non-CTRL signal calls TerminateProcess, so the first poll would have killed the owner editor it was checking (then crashed the reaper task on the next poll). CI never surfaced it because ci-start-server starts the server with no owner pid, so the reaper is never armed in CI on any platform — my macOS live smoke was the only place it ran.

Fix is conservative: the reaper is now gated off on Windows (both should_arm_reaper() in Python and the plugin skipping GODOT_AI_OWNER_PID). Windows keeps its prior behavior — clean editor shutdown still stops the server; a hard-crash orphan lingers until the next session's port reconciliation, exactly as before. pid_alive() was also made non-destructive on Windows (OpenProcess + WaitForSingleObject, never TerminateProcess) as defense-in-depth. Re-verified the macOS reap still works after the gate.

Coverage honesty: the armed reaper has only run live on macOS. Linux is POSIX-identical (os.kill(pid,0)) so low-risk but not live-exercised; Windows is intentionally disabled pending validation on real Windows.

The reaper had no CI coverage at all (ci-start-server never passes an owner
pid, so the plugin-armed path never runs in CI). Close that — and the
"armed reaper only ever ran on macOS" gap — with an integration test that
spawns a real `python -m godot_ai --owner-pid` server, kills a throwaway owner
process, and asserts the server self-terminates and frees its port.

This exercises the genuine POSIX primitives (os.kill(pid,0) liveness + SIGTERM
self-shutdown → uvicorn graceful drain) on every OS pytest runs on — crucially
the ubuntu runners (Python 3.11 + 3.13), giving real Linux coverage of the
armed reaper. Skipped on Windows, where the reaper is disabled.

Adds GODOT_AI_REAPER_POLL_SECONDS so the test drives a sub-second reap instead
of waiting the production 5s; malformed/absent falls back to the default.

Passes locally on macOS (Python 3.13) in ~2s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ties the Python MCP server lifecycle to the Godot editor that spawned it, so detached servers don’t linger on ports after an editor crash/hard-kill and accidentally get adopted by later sessions.

Changes:

  • Added owner-PID plumbing from the Godot plugin → server process (via GODOT_AI_OWNER_PID env and a --owner-pid CLI override).
  • Introduced an orphan_reaper watchdog that self-terminates the server when the owning editor PID is gone and no editor sessions are connected.
  • Added unit + integration tests covering reaper behavior and CLI/env owner-pid parsing.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/unit/test_runtime_info.py Updates server stub to accept new owner_pid kwarg.
tests/unit/test_orphan_reaper.py Unit-tests should_arm_reaper, pid_alive, and watch_owner behavior.
tests/unit/test_cli_reload.py Tests owner-pid flag/env parsing and forwarding into create_server(...).
tests/integration/test_orphan_reaper_subprocess.py End-to-end subprocess test verifying real self-termination after owner death (non-Windows).
src/godot_ai/server.py Arms/cancels the reaper task during ASGI lifespan when owner_pid is provided.
src/godot_ai/orphan_reaper.py New watchdog implementation (poll interval, pid liveness, graceful SIGTERM).
src/godot_ai/init.py Adds --owner-pid and env fallback; forwards into create_server(...).
plugin/addons/godot_ai/utils/server_lifecycle.gd Sets GODOT_AI_OWNER_PID when spawning the detached server (non-Windows).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/godot_ai/orphan_reaper.py Outdated
Comment on lines +106 to +117
import ctypes

SYNCHRONIZE = 0x00100000
WAIT_TIMEOUT = 0x00000102 # still running
kernel32 = ctypes.windll.kernel32 # type: ignore[attr-defined]
handle = kernel32.OpenProcess(SYNCHRONIZE, False, pid)
if not handle:
return False # no such process (or no access) — treat as gone
try:
return kernel32.WaitForSingleObject(handle, 0) == WAIT_TIMEOUT
finally:
kernel32.CloseHandle(handle)
Comment on lines +524 to +528
## Tell the spawned server which editor owns it so it can self-reap if we
## die without a clean stop_server (crash / hard-kill). Passed via env, not
## a CLI flag, so an older server (staggered user-mode upgrade) silently
## ignores an unknown var instead of failing argparse. Left set, not
## restored: it's this editor's own constant PID, so it's correct for every
Comment thread src/godot_ai/orphan_reaper.py Outdated
Comment on lines +106 to +117
import ctypes

SYNCHRONIZE = 0x00100000
WAIT_TIMEOUT = 0x00000102 # still running
kernel32 = ctypes.windll.kernel32 # type: ignore[attr-defined]
handle = kernel32.OpenProcess(SYNCHRONIZE, False, pid)
if not handle:
return False # no such process (or no access) — treat as gone
try:
return kernel32.WaitForSingleObject(handle, 0) == WAIT_TIMEOUT
finally:
kernel32.CloseHandle(handle)
… docs, test flake fixes

Addresses review findings (self-review + codex) on the reaper:

1. Adoption hand-off race (correctness): watch_owner now requires the
   "owner dead AND zero sessions" condition to hold on TWO samples grace_seconds
   apart (default: one poll interval). A transient zero-session blip — an
   adopter editor's WebSocket reconnecting across a plugin reload / GC pause —
   no longer triggers a wrong reap that would SIGTERM the server out from under
   the live adopter. Cost: a genuine orphan reaps in ~poll+grace instead of
   ~poll. New unit test test_grace_recheck_prevents_reap_on_transient_zero.

2. Owner-env leak (codex blocking finding): GODOT_AI_OWNER_PID was left set in
   the editor's process env, so a later non-reload godot-ai subprocess could
   inherit it and wrongly arm a reaper. Now scoped tightly around each spawn
   (set -> create_process -> unset), via a shared _set_owner_pid_env() helper,
   in BOTH start_server and the refresh-retry path (which previously relied on
   the leftover env). Verified live: the spawned server inherits the var, the
   editor env does not retain it, and the reaper still arms + reaps.

3. pid_alive docstring (honesty): corrected the "only delays a reap" claim — a
   recycled owner PID can cause a permanent missed reap (bare-pid check, no
   identity proof). Documented as a known limitation that degrades to the prior
   port-reconciliation behavior, not an unbounded leak.

4. Integration test flake: dedup http/ws ports (two _free_port() calls could
   collide) and poll for port release after exit instead of asserting once
   (TIME_WAIT).

pytest 1111 passed; 0 GDScript parse errors on 4.6.3 + 4.3; live reap re-verified.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dsarno

dsarno commented May 30, 2026

Copy link
Copy Markdown
Contributor Author

Review fixes (6648027) — addresses codex's blocking finding + a high-effort self-review.

  • Owner-env leak (codex blocking): GODOT_AI_OWNER_PID is no longer left set in the editor process env. It's now scoped tightly around each OS.create_process (set → spawn → unset) via a shared _set_owner_pid_env() helper, in both start_server and the refresh-retry path (the latter previously relied on the leftover env — fragile). Verified live: the spawned server inherits the var, the editor env does not retain it, and the reaper still arms + reaps.
  • Adoption hand-off race (self-review, correctness): the reaper now requires "owner dead AND zero sessions" to hold on two samples a grace window apart, so a transient zero-session blip (an adopter's WebSocket reconnecting across a plugin reload) can't SIGTERM the server out from under the live adopter. New unit test covers it.
  • pid_alive docstring: corrected the "only delays a reap" claim — a recycled owner PID can cause a permanent missed reap (documented as degrading to the prior port-reconciliation behavior, not an unbounded leak).
  • Integration-test flake: dedup the two ephemeral ports and poll for port release after exit (TIME_WAIT).

pytest 1111 passed; 0 GDScript parse errors on 4.6.3 + 4.3; reap re-verified live on macOS (server inherits env, editor doesn't, kill→reap in ~poll+grace).

Copilot flagged _pid_alive_windows for using ctypes.windll.kernel32 without
argtypes/restype (HANDLE truncation on 64-bit Windows) and treating access-
denied as "dead" (opposite of the POSIX conservative behavior). Since the
reaper is disabled on Windows (should_arm_reaper) the function never ran in
production and was untested — both this review and the prior one flagged it as
dead code. Rather than ship/maintain an unexercised process-control probe,
remove it and make pid_alive raise NotImplementedError on Windows. That keeps
the destructive os.kill(pid,0) path unreachable on Windows and forces a future
Windows enablement to add a real, access-denied-conservative liveness check.
New unit test asserts the raise.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dsarno dsarno merged commit 5f7ce91 into main May 30, 2026
15 checks passed
@dsarno dsarno deleted the server-orphan-reaper branch May 30, 2026 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants