Skip to content

Commit be10171

Browse files
dsarnoclaude
andauthored
[Audit v2 #353] 30s watchdog on filesystem_changed during update reload (#381)
* [Audit v2 #353] 30s watchdog on filesystem_changed during update reload Pre-fix, `_start_filesystem_scan` connected `filesystem_changed` (CONNECT_ONE_SHOT) and called `fs.scan()`, then sat in `_waiting_for_scan = true` forever if the signal never fired (slow disk, NFS, AV holding the just-extracted addon files open). The update completed mechanically — files on disk, plugin disabled — but the dock stayed disabled because `_finish_scan_wait` was the only path back to `_enable_new_plugin`. User had to kill and restart Godot. Add a 30-second `Timer` armed alongside the signal connect: - `_arm_scan_watchdog`: lazy-creates a one-shot `Timer` child node on first use, reuses it on subsequent arms (the runner does two filesystem scans per update — new files then existing files). - `_on_scan_watchdog_timeout`: if the wait is still open, push a warning, disconnect the `filesystem_changed` listener so a delayed signal can't double-call, then dispatch `_finish_scan_wait`. `_finish_scan_wait`'s existing `if not _waiting_for_scan: return` guard makes it idempotent against the signal-arrives-just-before-watchdog race. - `_finish_scan_wait` now also stops the still-running timer on the happy path, so a queued late `timeout` can't fire after a successful scan. Worst case after the watchdog fires: the new files aren't visible on the first frame the new plugin enables, but they get picked up on the next scan. That's strictly better than the prior dock- permanently-disabled state. Tests in `test_project/tests/test_update_reload_runner.gd`: - `test_watchdog_timeout_proceeds_when_signal_never_fires` — the deadlock case the issue describes. - `test_watchdog_no_op_when_signal_already_settled` — late-Timer race after the signal won. - `test_finish_scan_wait_stops_armed_watchdog` — happy path cancels the Timer. - `test_watchdog_timer_reused_across_arms` — second scan in the same update doesn't leak a second Timer child. Tests fire `_on_scan_watchdog_timeout` and `_finish_scan_wait` directly rather than waiting for the wall-clock Timer, keeping runs deterministic and sub-second. Closes #353 * Fix CI: remove add_child(runner) from watchdog tests McpTestSuite extends RefCounted (not Node), so add_child(runner) errors at test runtime — Linux/macOS/Windows all failed with the same root cause. The tests fire _on_scan_watchdog_timeout() directly instead of relying on the Timer counting down, so the runner doesn't actually need to be in a SceneTree; the Timer just needs to be a child of the runner (works regardless of where the runner sits). * Redesign per Copilot review: sticky bypass after watchdog + tests in tree Two issues from Copilot's review on PR #381 + the second CI failure: (A) Cross-scan signal race [Copilot]: scan #1 watchdog'd, scan #2 then arms a fresh `filesystem_changed` listener. A delayed emission from scan #1 fires on whichever listener is currently connected to the shared signal — Godot can't tag emissions with their source scan — so scan #2's listener falsely settles before its actual filesystem scan completed, re-enabling the plugin against a potentially incomplete on-disk install. Generation-counter + Callable.bind doesn't help because a single emission still fires every connected listener (CONNECT_ONE_SHOT only auto-disconnects after firing). (B) Second CI failure: `Timer.start()` requires the Timer to be inside a SceneTree. The previous fix removed `add_child(runner)` from tests but didn't parent the runner anywhere, so when `_arm_scan_watchdog` called `add_child(timer)` and `timer.start()` inside the runner, the timer wasn't in any tree. Fix: - Add a sticky `_scan_timed_out` flag set by `_on_scan_watchdog_timeout`. - `_start_filesystem_scan` checks the flag at the top and bypasses the connect + `fs.scan()` path entirely, falling straight through to `call_deferred(deferred_step)`. With no listener armed, a delayed emission from the timed-out scan has nothing to satisfy. Godot's normal background scan catches up after the plugin re-enables. - Tests now parent the runner via `(Engine.get_main_loop() as SceneTree).root.add_child(runner)` so `Timer.start()` works. Cleanup via a `_free_runner` helper (remove_child + free). - New regression test `test_subsequent_scan_after_watchdog_bypasses_ listener_arm` explicitly drives scan #1 → watchdog → scan #2 and asserts scan #2 does NOT set `_waiting_for_scan = true` (i.e. no listener armed for a delayed scan-#1 emission to satisfy). - Existing watchdog test `test_watchdog_no_op_when_signal_already_settled` also asserts `_scan_timed_out` stays false on the late-Timer-after- successful-signal path — guards against the watchdog poisoning subsequent scans when no actual deadlock happened. --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent dfb8e02 commit be10171

2 files changed

Lines changed: 226 additions & 0 deletions

File tree

plugin/addons/godot_ai/update_reload_runner.gd

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,23 @@ var _next_step := ""
3535
var _frames_remaining := 0
3636
var _waiting_for_scan := false
3737
var _scan_next_step := ""
38+
## Watchdog for `_start_filesystem_scan`: if Godot's `filesystem_changed`
39+
## signal never fires (slow disk, NFS, AV holding the just-extracted addon
40+
## files open), the runner used to hang in `_waiting_for_scan = true`
41+
## forever and the dock stayed disabled. After this timeout we disconnect
42+
## the signal and proceed anyway — worst case the new files aren't visible
43+
## on the first frame, but they get picked up on the next scan. See
44+
## audit-v2 finding #9 (issue #353). Untyped to match the codebase's
45+
## defensive pattern for state that survives `fs.scan()` during update.
46+
const SCAN_WATCHDOG_SECS := 30.0
47+
var _scan_watchdog_timer = null
48+
## Sticky flag set by `_on_scan_watchdog_timeout`. Subsequent
49+
## `_start_filesystem_scan` calls in the same update bypass connect+scan
50+
## so a delayed `filesystem_changed` emission from the timed-out scan
51+
## can't fire on a freshly-armed listener for the next scan and falsely
52+
## settle it before that scan actually completed. See PR #381 review for
53+
## the cross-scan race this guards against.
54+
var _scan_timed_out := false
3855
## Keep Array fields untyped: this runner survives fs.scan() during update,
3956
## and typed Variant storage is part of the hot-reload crash class.
4057
var _new_file_paths = []
@@ -121,13 +138,68 @@ func _start_filesystem_scan(next_step: String = "_enable_new_plugin") -> void:
121138
call_deferred(deferred_step)
122139
return
123140

141+
## Bypass: a previous scan in this update already watchdog'd, so the
142+
## editor's filesystem is unresponsive. Re-arming a `filesystem_changed`
143+
## listener now would race with a delayed emission from the timed-out
144+
## scan: that single emission would fire whichever listener is currently
145+
## connected to the shared signal, falsely settling this scan before it
146+
## actually completed. Skip the wait; Godot's normal background scan
147+
## catches up after the plugin re-enables. See PR #381 review.
148+
if _scan_timed_out:
149+
push_warning(
150+
"MCP | skipping filesystem_changed wait after previous timeout (next_step=%s)"
151+
% deferred_step
152+
)
153+
call_deferred(deferred_step)
154+
return
155+
124156
_waiting_for_scan = true
125157
_scan_next_step = deferred_step
126158
if not fs.filesystem_changed.is_connected(_on_filesystem_changed):
127159
fs.filesystem_changed.connect(_on_filesystem_changed, CONNECT_ONE_SHOT)
160+
_arm_scan_watchdog()
128161
fs.scan()
129162

130163

164+
func _arm_scan_watchdog() -> void:
165+
if _scan_watchdog_timer == null:
166+
_scan_watchdog_timer = Timer.new()
167+
_scan_watchdog_timer.one_shot = true
168+
_scan_watchdog_timer.timeout.connect(_on_scan_watchdog_timeout)
169+
add_child(_scan_watchdog_timer)
170+
_scan_watchdog_timer.start(SCAN_WATCHDOG_SECS)
171+
172+
173+
func _stop_scan_watchdog() -> void:
174+
if _scan_watchdog_timer != null:
175+
_scan_watchdog_timer.stop()
176+
177+
178+
func _on_scan_watchdog_timeout() -> void:
179+
## Signal didn't fire within SCAN_WATCHDOG_SECS — most likely the
180+
## filesystem scan is blocked behind a slow disk / NFS / AV scanner
181+
## still reading the just-extracted addon files.
182+
## Set the sticky `_scan_timed_out` flag so any subsequent
183+
## `_start_filesystem_scan` in this update skips its connect+scan
184+
## (otherwise a delayed emission from this scan would falsely settle
185+
## the next scan's listener — see PR #381 review).
186+
## Disconnect the current listener too, so this scan's listener can't
187+
## double-call `_finish_scan_wait` if the signal arrives quickly after
188+
## the timeout fires. `_finish_scan_wait` is idempotent on
189+
## `_waiting_for_scan == false`.
190+
if not _waiting_for_scan:
191+
return
192+
push_warning(
193+
"MCP | filesystem_changed didn't fire within %ds; proceeding without scan confirmation"
194+
% int(SCAN_WATCHDOG_SECS)
195+
)
196+
_scan_timed_out = true
197+
var fs := EditorInterface.get_resource_filesystem()
198+
if fs != null and fs.filesystem_changed.is_connected(_on_filesystem_changed):
199+
fs.filesystem_changed.disconnect(_on_filesystem_changed)
200+
_finish_scan_wait()
201+
202+
131203
func _read_update_manifest() -> bool:
132204
var zip_path := ProjectSettings.globalize_path(_zip_path)
133205
var install_base := ProjectSettings.globalize_path(INSTALL_BASE_PATH)
@@ -392,6 +464,7 @@ func _finish_scan_wait() -> void:
392464
if not _waiting_for_scan:
393465
return
394466
_waiting_for_scan = false
467+
_stop_scan_watchdog()
395468
var next_step := _scan_next_step
396469
_scan_next_step = ""
397470
set_process(false)

test_project/tests/test_update_reload_runner.gd

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -478,3 +478,156 @@ func test_install_zip_paths_rolls_back_when_mid_loop_write_fails() -> void:
478478
# (rare but possible) doesn't re-attempt rollback against stale records.
479479
assert_eq(runner._paths_written.size(), 0, "_paths_written cleared after rollback")
480480
runner.free()
481+
482+
483+
# ----- _arm_scan_watchdog / _on_scan_watchdog_timeout (audit-v2 #9) -----
484+
485+
486+
func _scene_root() -> Node:
487+
# `Timer.start()` requires the timer (and thus its parent) to be
488+
# inside a SceneTree. `McpTestSuite` extends `RefCounted` so it can't
489+
# act as a parent itself — parent the runner under the editor's
490+
# SceneTree root for the duration of the test, then remove + free.
491+
var tree := Engine.get_main_loop() as SceneTree
492+
return tree.root if tree != null else null
493+
494+
495+
func _new_runner_in_tree():
496+
var runner = _new_runner()
497+
var root := _scene_root()
498+
assert_true(root != null, "test setup: SceneTree root must exist")
499+
root.add_child(runner)
500+
return runner
501+
502+
503+
func _free_runner(runner) -> void:
504+
if runner.get_parent() != null:
505+
runner.get_parent().remove_child(runner)
506+
runner.free()
507+
508+
509+
func _arm_scan_state(runner) -> void:
510+
# Mirror what `_start_filesystem_scan` would do, minus the actual
511+
# `EditorInterface.get_resource_filesystem().scan()` call. We can't
512+
# trigger Godot's filesystem_changed signal from a test, so we drive
513+
# the state machine directly.
514+
runner._waiting_for_scan = true
515+
runner._scan_next_step = "_enable_new_plugin"
516+
runner._arm_scan_watchdog()
517+
518+
519+
func test_watchdog_timeout_proceeds_when_signal_never_fires() -> void:
520+
## Pre-fix, if `filesystem_changed` deadlocked the runner sat in
521+
## `_waiting_for_scan = true` forever. The watchdog must clear that
522+
## flag and dispatch `_scan_next_step` so the rest of the update
523+
## sequence can finish.
524+
var runner = _new_runner_in_tree()
525+
_arm_scan_state(runner)
526+
assert_true(runner._waiting_for_scan, "scan wait armed by precondition")
527+
assert_true(runner._scan_watchdog_timer != null, "watchdog timer node exists once armed")
528+
529+
runner._on_scan_watchdog_timeout()
530+
531+
assert_false(runner._waiting_for_scan, "watchdog cleared the wait flag")
532+
assert_eq(runner._scan_next_step, "", "next-step token consumed by _finish_scan_wait")
533+
assert_true(runner._scan_timed_out, "watchdog must set the sticky bypass flag")
534+
_free_runner(runner)
535+
536+
537+
func test_watchdog_no_op_when_signal_already_settled() -> void:
538+
## Race: signal fires, `_finish_scan_wait` cleans up, then the watchdog
539+
## Timer fires anyway (Godot's Timer can have queued timeout). Calling
540+
## `_on_scan_watchdog_timeout` after a settled scan must be a no-op —
541+
## otherwise it would double-dispatch `_scan_next_step` AND it must
542+
## NOT poison subsequent scans by setting `_scan_timed_out = true`.
543+
var runner = _new_runner_in_tree()
544+
_arm_scan_state(runner)
545+
546+
# Simulate the happy path: signal arrived, _finish_scan_wait ran.
547+
runner._finish_scan_wait()
548+
assert_false(runner._waiting_for_scan, "wait already cleared by signal handler")
549+
550+
# Now the late watchdog timeout fires. Must not flip _waiting_for_scan
551+
# back to true, must not set _scan_timed_out (the scan succeeded).
552+
runner._on_scan_watchdog_timeout()
553+
assert_false(runner._waiting_for_scan, "watchdog stays no-op after settled wait")
554+
assert_false(
555+
runner._scan_timed_out,
556+
"watchdog no-op must not poison _scan_timed_out — the scan actually succeeded",
557+
)
558+
_free_runner(runner)
559+
560+
561+
func test_finish_scan_wait_stops_armed_watchdog() -> void:
562+
## Happy path: filesystem_changed signal arrives; `_finish_scan_wait`
563+
## must stop the still-running Timer so it doesn't fire later and
564+
## attempt a second cleanup. Verify by inspecting the Timer state.
565+
var runner = _new_runner_in_tree()
566+
_arm_scan_state(runner)
567+
assert_false(runner._scan_watchdog_timer.is_stopped(), "timer running after arm")
568+
569+
runner._finish_scan_wait()
570+
571+
assert_true(
572+
runner._scan_watchdog_timer.is_stopped(),
573+
"finish_scan_wait must stop the watchdog so it can't fire later",
574+
)
575+
_free_runner(runner)
576+
577+
578+
func test_watchdog_timer_reused_across_arms() -> void:
579+
## `_arm_scan_watchdog` lazy-creates the Timer on first use and reuses
580+
## it on subsequent arms. The runner makes two filesystem scans during
581+
## a single update (new files, then existing files), so the second arm
582+
## must not leak a second Timer child.
583+
var runner = _new_runner_in_tree()
584+
_arm_scan_state(runner)
585+
var first_timer = runner._scan_watchdog_timer
586+
runner._finish_scan_wait()
587+
588+
_arm_scan_state(runner)
589+
var second_timer = runner._scan_watchdog_timer
590+
591+
assert_true(first_timer == second_timer, "timer reused, not recreated")
592+
runner._finish_scan_wait()
593+
_free_runner(runner)
594+
595+
596+
func test_subsequent_scan_after_watchdog_bypasses_listener_arm() -> void:
597+
## Cross-scan race regression (PR #381 review): if scan #1 watchdog'd,
598+
## scan #2 must NOT arm a fresh `filesystem_changed` listener.
599+
##
600+
## Why: a delayed `filesystem_changed` emission from scan #1 fires on
601+
## any listener currently connected to the shared signal — Godot can't
602+
## tag emissions with their source scan. So if scan #2 has armed a
603+
## fresh listener by the time scan #1's emission finally arrives, that
604+
## listener fires and falsely settles scan #2 before its actual
605+
## filesystem scan completed — re-enabling the plugin against a
606+
## potentially-incomplete on-disk install.
607+
##
608+
## Fix: the watchdog sets the sticky `_scan_timed_out` flag, and
609+
## `_start_filesystem_scan` checks it at the top and bypasses the
610+
## connect+scan path entirely. The pending plugin re-enable still
611+
## happens via `call_deferred(next_step)` — Godot's normal background
612+
## scan catches up after the plugin re-enables.
613+
var runner = _new_runner_in_tree()
614+
615+
# Scan #1: arm + watchdog timeout.
616+
_arm_scan_state(runner)
617+
runner._on_scan_watchdog_timeout()
618+
assert_true(runner._scan_timed_out, "watchdog set the sticky flag")
619+
assert_false(runner._waiting_for_scan, "watchdog cleared the wait")
620+
621+
# Scan #2 attempts to start. Pre-fix, this re-armed the listener. Now,
622+
# the bypass path must short-circuit before _waiting_for_scan flips to
623+
# true. We pass a benign deferred step (`set_process`) so the
624+
# `call_deferred` in the bypass branch doesn't re-enable the plugin
625+
# in the test environment.
626+
runner._start_filesystem_scan("set_process")
627+
assert_false(
628+
runner._waiting_for_scan,
629+
"post-watchdog _start_filesystem_scan must NOT arm a new listener — "
630+
+ "no listener means no false-settle from a delayed scan-#1 emission",
631+
)
632+
633+
_free_runner(runner)

0 commit comments

Comments
 (0)