Reap orphaned servers when their owning editor dies#495
Conversation
The plugin spawns the MCP server as a detached child (OS.create_process) so it survives a plugin reload and can be re-adopted. The only teardown is the editor's clean _exit_tree -> stop_server; a crash or hard-kill skips that and leaves the server squatting on ports 8000/9500 until a human or the next session's port reconciliation kills it. That's how a fresh session inherits a stale older-version server. Tie the server's life to its owner: when the plugin auto-spawns it, it now sets GODOT_AI_OWNER_PID to the editor's PID. The server runs a lightweight watchdog (orphan_reaper.watch_owner) that polls that PID and, once the owner editor is gone AND no editor session is connected, shuts itself down via SIGTERM (the same graceful path uvicorn already handles). The "no sessions" guard preserves adoption: a different editor that adopted the server holds a live WebSocket session, so the watchdog leaves it running. Why an env var, not a --owner-pid flag, for the plugin->server channel: an older server in a staggered user-mode upgrade would reject an unknown argparse flag and fail to spawn, whereas it silently ignores an unknown env var. The CLI flag still exists for tests/manual use; main() prefers it and falls back to the env var. Servers started without an owner (CI's ci-start-server, manual --reload dev, uvx one-shots) never arm the watchdog and behave exactly as before. Verified: - pytest: 1106 passed (6 new orphan_reaper unit tests + owner-pid plumbing tests; updated 3 create_server stubs for the new kwarg). - Live end-to-end on Godot 4.6.3: plugin spawned the worktree server, log showed "Orphan reaper armed for owner editor pid <N>"; SIGKILL of the editor (crash sim) reaped the server in ~1s and freed both ports. - 0 GDScript parse errors on Godot 4.6.3 and 4.3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
The reaper's liveness probe used os.kill(owner_pid, 0). On POSIX that's a harmless existence check, but on Windows os.kill has no signal-0 concept — any non-CTRL signal calls TerminateProcess. So on Windows the first poll would have *killed the owner editor* it was meant to check (then crashed the reaper task on the next poll with an uncaught OSError). CI never caught this because ci-start-server starts the server without an owner pid, so the reaper is never armed in CI on any platform; the macOS live smoke was the only place it ran. Fix, conservatively: - should_arm_reaper() disables the reaper on Windows (sys.platform win*). The plugin also skips setting GODOT_AI_OWNER_PID on Windows. Windows keeps its pre-change behavior: clean editor shutdown still stops the server; a hard- crash orphan lingers until the next session's port reconciliation (status quo, no regression). Tracked as a follow-up to validate a Windows liveness path on real Windows before enabling. - pid_alive() no longer uses os.kill on Windows: it opens a SYNCHRONIZE handle and asks WaitForSingleObject (never TerminateProcess). Defensive only today since the reaper is gated off there, but correct if ever enabled / called. - pid_alive() POSIX path also catches generic OSError -> treat as alive (conservative: never reap on an ambiguous errno). Unit tests cover should_arm_reaper across posix/windows + invalid pids. macOS live reap re-verified after the gate (server reaped ~4s, ports freed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Follow-up (cfa0846): found and fixed a Windows landmine while smoke-testing harder. The liveness probe used Fix is conservative: the reaper is now gated off on Windows (both Coverage honesty: the armed reaper has only run live on macOS. Linux is POSIX-identical ( |
The reaper had no CI coverage at all (ci-start-server never passes an owner pid, so the plugin-armed path never runs in CI). Close that — and the "armed reaper only ever ran on macOS" gap — with an integration test that spawns a real `python -m godot_ai --owner-pid` server, kills a throwaway owner process, and asserts the server self-terminates and frees its port. This exercises the genuine POSIX primitives (os.kill(pid,0) liveness + SIGTERM self-shutdown → uvicorn graceful drain) on every OS pytest runs on — crucially the ubuntu runners (Python 3.11 + 3.13), giving real Linux coverage of the armed reaper. Skipped on Windows, where the reaper is disabled. Adds GODOT_AI_REAPER_POLL_SECONDS so the test drives a sub-second reap instead of waiting the production 5s; malformed/absent falls back to the default. Passes locally on macOS (Python 3.13) in ~2s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR ties the Python MCP server lifecycle to the Godot editor that spawned it, so detached servers don’t linger on ports after an editor crash/hard-kill and accidentally get adopted by later sessions.
Changes:
- Added owner-PID plumbing from the Godot plugin → server process (via
GODOT_AI_OWNER_PIDenv and a--owner-pidCLI override). - Introduced an
orphan_reaperwatchdog that self-terminates the server when the owning editor PID is gone and no editor sessions are connected. - Added unit + integration tests covering reaper behavior and CLI/env owner-pid parsing.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_runtime_info.py | Updates server stub to accept new owner_pid kwarg. |
| tests/unit/test_orphan_reaper.py | Unit-tests should_arm_reaper, pid_alive, and watch_owner behavior. |
| tests/unit/test_cli_reload.py | Tests owner-pid flag/env parsing and forwarding into create_server(...). |
| tests/integration/test_orphan_reaper_subprocess.py | End-to-end subprocess test verifying real self-termination after owner death (non-Windows). |
| src/godot_ai/server.py | Arms/cancels the reaper task during ASGI lifespan when owner_pid is provided. |
| src/godot_ai/orphan_reaper.py | New watchdog implementation (poll interval, pid liveness, graceful SIGTERM). |
| src/godot_ai/init.py | Adds --owner-pid and env fallback; forwards into create_server(...). |
| plugin/addons/godot_ai/utils/server_lifecycle.gd | Sets GODOT_AI_OWNER_PID when spawning the detached server (non-Windows). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import ctypes | ||
|
|
||
| SYNCHRONIZE = 0x00100000 | ||
| WAIT_TIMEOUT = 0x00000102 # still running | ||
| kernel32 = ctypes.windll.kernel32 # type: ignore[attr-defined] | ||
| handle = kernel32.OpenProcess(SYNCHRONIZE, False, pid) | ||
| if not handle: | ||
| return False # no such process (or no access) — treat as gone | ||
| try: | ||
| return kernel32.WaitForSingleObject(handle, 0) == WAIT_TIMEOUT | ||
| finally: | ||
| kernel32.CloseHandle(handle) |
| ## Tell the spawned server which editor owns it so it can self-reap if we | ||
| ## die without a clean stop_server (crash / hard-kill). Passed via env, not | ||
| ## a CLI flag, so an older server (staggered user-mode upgrade) silently | ||
| ## ignores an unknown var instead of failing argparse. Left set, not | ||
| ## restored: it's this editor's own constant PID, so it's correct for every |
| import ctypes | ||
|
|
||
| SYNCHRONIZE = 0x00100000 | ||
| WAIT_TIMEOUT = 0x00000102 # still running | ||
| kernel32 = ctypes.windll.kernel32 # type: ignore[attr-defined] | ||
| handle = kernel32.OpenProcess(SYNCHRONIZE, False, pid) | ||
| if not handle: | ||
| return False # no such process (or no access) — treat as gone | ||
| try: | ||
| return kernel32.WaitForSingleObject(handle, 0) == WAIT_TIMEOUT | ||
| finally: | ||
| kernel32.CloseHandle(handle) |
… docs, test flake fixes Addresses review findings (self-review + codex) on the reaper: 1. Adoption hand-off race (correctness): watch_owner now requires the "owner dead AND zero sessions" condition to hold on TWO samples grace_seconds apart (default: one poll interval). A transient zero-session blip — an adopter editor's WebSocket reconnecting across a plugin reload / GC pause — no longer triggers a wrong reap that would SIGTERM the server out from under the live adopter. Cost: a genuine orphan reaps in ~poll+grace instead of ~poll. New unit test test_grace_recheck_prevents_reap_on_transient_zero. 2. Owner-env leak (codex blocking finding): GODOT_AI_OWNER_PID was left set in the editor's process env, so a later non-reload godot-ai subprocess could inherit it and wrongly arm a reaper. Now scoped tightly around each spawn (set -> create_process -> unset), via a shared _set_owner_pid_env() helper, in BOTH start_server and the refresh-retry path (which previously relied on the leftover env). Verified live: the spawned server inherits the var, the editor env does not retain it, and the reaper still arms + reaps. 3. pid_alive docstring (honesty): corrected the "only delays a reap" claim — a recycled owner PID can cause a permanent missed reap (bare-pid check, no identity proof). Documented as a known limitation that degrades to the prior port-reconciliation behavior, not an unbounded leak. 4. Integration test flake: dedup http/ws ports (two _free_port() calls could collide) and poll for port release after exit instead of asserting once (TIME_WAIT). pytest 1111 passed; 0 GDScript parse errors on 4.6.3 + 4.3; live reap re-verified. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Review fixes (6648027) — addresses codex's blocking finding + a high-effort self-review.
pytest 1111 passed; 0 GDScript parse errors on 4.6.3 + 4.3; reap re-verified live on macOS (server inherits env, editor doesn't, kill→reap in ~poll+grace). |
Copilot flagged _pid_alive_windows for using ctypes.windll.kernel32 without argtypes/restype (HANDLE truncation on 64-bit Windows) and treating access- denied as "dead" (opposite of the POSIX conservative behavior). Since the reaper is disabled on Windows (should_arm_reaper) the function never ran in production and was untested — both this review and the prior one flagged it as dead code. Rather than ship/maintain an unexercised process-control probe, remove it and make pid_alive raise NotImplementedError on Windows. That keeps the destructive os.kill(pid,0) path unreachable on Windows and forces a future Windows enablement to add a real, access-denied-conservative liveness check. New unit test asserts the raise. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Problem
The plugin spawns the MCP server as a detached child (
OS.create_process) so it survives a plugin reload and can be re-adopted. The only teardown is the editor's clean_exit_tree → stop_server. A crash or hard-kill skips that, leaving the server squatting on ports 8000/9500 until a human or the next session's port reconciliation kills it — which is how a fresh session inherits a stale older-version server (the exact thing seen during the audit-PR smoke).Fix
Tie the server's life to its owner:
GODOT_AI_OWNER_PIDto the editor's PID (server_lifecycle.gd).orphan_reaper.watch_owner) that polls that PID and, once the owner editor is gone and no editor session is connected, shuts itself down viaSIGTERM(the graceful path uvicorn already handles).Why an env var, not a
--owner-pidflag, for the plugin→server channel: an older server in a staggered user-mode upgrade would reject an unknown argparse flag and fail to spawn, whereas it silently ignores an unknown env var. The--owner-pidCLI flag still exists for tests/manual use;main()prefers it and falls back to the env var.Servers started without an owner (CI's
ci-start-server, manual--reloaddev,uvxone-shots) never arm the watchdog and behave exactly as before.Verification
orphan_reaperunit tests (reap-when-dead, no-reap-while-alive, no-reap-when-adopted, reap-after-adopter-leaves,pid_alive), owner-pid plumbing tests (flag / env / malformed-env), and updated the 3create_serverstubs for the new kwarg.Orphan reaper armed for owner editor pid <N>;kill -9of the editor (crash simulation, bypassing_exit_tree) reaped the server in ~1s and freed both ports.🤖 Generated with Claude Code