Fix test_shutdown_worker leaving stale callback that breaks later tests#7279
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7279 +/- ##
==========================================
- Coverage 79.90% 79.90% -0.00%
==========================================
Files 568 568
Lines 43984 43984
==========================================
- Hits 35140 35139 -1
- Misses 8844 8845 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
a045f73 to
89a8172
Compare
danielhollas
left a comment
There was a problem hiding this comment.
Thanks for a thorough analysis! Makes sense to me.
| @pytest.fixture(autouse=True) | ||
| def _reset_runner(request): | ||
| yield | ||
| get_manager().reset_runner() |
There was a problem hiding this comment.
I guess deleting this fixture is only safe if we modify the manager fixture in the previous PR?
There was a problem hiding this comment.
Yes exactly by commit 46b6f66 we fix the bad fixture usage requirement of reseting the runner inside the test when using the manager fixture. There was however the subsequent issue with test_shutdown_worker that did not allow use to remove it in the PR #7268. Because its quite complicated requiring to understand implementation details from the python event loop, a different PR for discussion made also sense.
| assert runner.is_closed() | ||
| finally: | ||
| # Reset the runner of the manager, because once closed it cannot be reused by other tests. | ||
| manager._runner = None |
There was a problem hiding this comment.
why this is no longer needed?
There was a problem hiding this comment.
manager._runner = None I mean
89a8172 to
e69634e
Compare
danielhollas
left a comment
There was a problem hiding this comment.
I can't say I went through this in all the detail but it looks good to me as long as CI passes :-) (might be worth re-running CI couple times, or run the presto test suite locally in a loop for a while to verify that there are no other side-effects of changes here.
Root cause:
test_shutdown_worker was an async test (@pytest.mark.asyncio), so
pytest-asyncio ran it via loop.run_until_complete(). Inside the test,
shutdown_worker() calls runner.close() -> loop.stop(). Calling
loop.stop() from within loop.run_until_complete() leaves a stale
_run_until_complete_cb in the event loop's ready queue. This callback
calls loop.stop() when it fires, which poisons the next
run_until_complete() call on the same loop with:
RuntimeError: Event loop stopped before Future completed
Why the next test is affected:
The event_loop fixture in conftest.py returns manager.get_runner().loop.
After test_shutdown_worker, this loop still has the stale callback.
When a later test (e.g. test_calc_job_node_get_builder_restart) calls
run_until_complete() on the same loop, the stale callback fires
loop.stop() prematurely.
Why the _reset_runner autouse fixture masked the issue:
The _reset_runner fixture called manager.reset_runner() after every
test, which called runner.close() -> loop.close(). This closed the
loop entirely, so get_or_create_event_loop() created a fresh loop
for the next test without the stale callback. Removing this fixture
exposed the underlying bug.
Why run_forever fixes it:
In the real daemon, start_daemon_worker() uses loop.run_forever(),
not loop.run_until_complete(). run_forever() does not add a
_run_until_complete_cb, so loop.stop() cleanly exits the loop with
no stale callbacks. The fix changes the test to use the same pattern:
schedule shutdown_worker as a task, then call run_forever(). This
mirrors production behavior and avoids the stale callback.
e69634e to
41dac39
Compare
There was a problem hiding this comment.
Summary
So the CI errored out after the rebase. There was the problem that we have tests that share runner. I think its again flaky test, and we were lucky that this was reproduced after rebase.
Problem description
So these run_get_node functions they all secretly use a manager in the tests. They do not use the fixture manager, the use get_manager() from aiida. Further, manager is a singleton. Further, manager.get_runner() uses the same runner once initialized. Lastly, as long as the event loop is open, all runners get the same even loop plumpy.get_or_create_event_loop, because we want that nested aiida Process calls (e.g. calcfunction in a WorkChain) fill their tasks into the same event loop. runner fixture as a well behaved fixture and resets its even loop.
So tests/engine/test_process_function.py::test_plugin_version creates the runner with get_manager.get_runner() implicitly through run_get_node, then the next test tests/engine/test_runners.py::test_call_on_process_finish uses the runner fixture which gets the same event loop through plumpy.get_or_create_event_loop, then fixture teardowns and closes the event loop. Now the event loop from gget_manager.get_runner() is invalid. The next test tests/engine/test_runners.py::test_run_return_value_cached gets the same runner that as test tests/engine/test_process_function.py::test_plugin_version produced through get_manager.get_runner(). The event loops now closed however, and the check that the event loop is closed is only done on initialization of get_runner() but this is the second call using the cached runner. The whole thing as ASCII graphic
`tests/engine/test_process_function.py::test_plugin_version`
|
| creates / populates
v
Manager singleton
└── _runner = Runner A
└── loop = EventLoop L
`tests/engine/test_runners.py::test_call_on_process_finish`
|
| creates local fixture runner
v
Runner B
└── loop = EventLoop L
|
| fixture teardown calls `runner.close()`
v
closes EventLoop L
`tests/engine/test_runners.py::test_run_return_value_cached`
|
| later reuses Runner A via `get_manager().get_runner()`
v
uses EventLoop L (already closed)
Fix
Well the easiest fix was to make the runner fixture dependent on the manager fixture, by that the runner fixture is also becoming a singleton'ish through the test and we do not have two runner instances conflicting with each other.
Future work
There is a bigger subsequent question. Should Runner not be also a singleton given the fact that two Runner instances can share the same event loop. But maybe the design decision that simplifies the nested calling of aiida processes to reuse the same event loop through get_or_create_event_loop was not a good call, because it makes Runners sharing a state. But I don't know how to solve this in a simple way. In the serialization of the process state, we do not persist the event loop, so on recovery assuming that there is only one event loop in all runners simplified this problem. On the other hand we have runner singletonized by manager.get_runner so this should be a fine solution. Its just that the manager only exists in aiida-core and not plumpy. So maybe we improve this when we move plumpy into aiida-core.
Prevent tests from closing the shared manager runner directly. The original CI failure was caused by tests creating a custom runner that could still share the event loop of the cached global manager runner. When the custom runner was closed in teardown, it closed that shared loop as well. A later test then reused the cached manager runner and failed because it still pointed to the closed loop. This commit makes that contract explicit in the test fixtures. Tests are no longer allowed to close the global manager runner. Instead, tests that need shutdown semantics must use isolated runner fixtures with their own event loop. For tests that need to exercise manager-based shutdown, an isolated runner can be temporarily installed as the manager runner. The CalcJob caching test is also adjusted to clear the plugin version cache directly instead of resetting the global runner, which would now violate the stronger isolation policy.
200cabb to
4087193
Compare
Prevent tests from closing the shared manager runner directly. The original CI failure was caused by tests creating a custom runner that could still share the event loop of the cached global manager runner. When the custom runner was closed in teardown, it closed that shared loop as well. A later test then reused the cached manager runner and failed because it still pointed to the closed loop. This commit makes that contract explicit in the test fixtures. Tests are no longer allowed to close the global manager runner. Instead, tests that need shutdown semantics must use isolated runner fixtures with their own event loop. For tests that need to exercise manager-based shutdown, an isolated runner can be temporarily installed as the manager runner. The CalcJob caching test is also adjusted to clear the plugin version cache directly instead of resetting the global runner, which would now violate the stronger isolation policy.
4087193 to
8ca8906
Compare
Root cause:
test_shutdown_worker was an async test (@pytest.mark.asyncio), so
pytest-asyncio ran it via loop.run_until_complete(). Inside the test,
shutdown_worker() calls runner.close() -> loop.stop(). Calling
loop.stop() from within loop.run_until_complete() leaves a stale
_run_until_complete_cb in the event loop's ready queue. This callback
calls loop.stop() when it fires, which poisons the next
run_until_complete() call on the same loop with:
RuntimeError: Event loop stopped before Future completed
Why the next test is affected:
The event_loop fixture in conftest.py returns manager.get_runner().loop.
After test_shutdown_worker, this loop still has the stale callback.
When a later test (e.g. test_calc_job_node_get_builder_restart) calls
run_until_complete() on the same loop, the stale callback fires
loop.stop() prematurely.
Why the _reset_runner autouse fixture masked the issue:
The _reset_runner fixture called manager.reset_runner() after every
test, which called runner.close() -> loop.close(). This closed the
loop entirely, so get_or_create_event_loop() created a fresh loop
for the next test without the stale callback. Removing this fixture
exposed the underlying bug.
Why run_forever fixes it:
In the real daemon, start_daemon_worker() uses loop.run_forever(),
not loop.run_until_complete(). run_forever() does not add a
_run_until_complete_cb, so loop.stop() cleanly exits the loop with
no stale callbacks. The fix changes the test to use the same pattern:
schedule shutdown_worker as a task, then call run_forever(). This
mirrors production behavior and avoids the stale callback.
So in the end it is really a minor issue with a lot of complexity. It only affects the tests. This fix requires first #7268 to be merged because it relies on some fixes that release resources in the test (the manager fixture).
Root cause
test_shutdown_worker was an async test (@pytest.mark.asyncio), so pytest-asyncio ran it via loop.run_until_complete(). Inside the test, shutdown_worker() calls runner.close() -> loop.stop(). Calling loop.stop() from within loop.run_until_complete() leaves a stale _run_until_complete_cb in the event loop's ready queue. This callback calls loop.stop() when it fires, which poisons the next run_until_complete() call on the same loop with: RuntimeError: Event loop stopped before Future completed
Why the next test is affected
The event_loop fixture in conftest.py returns manager.get_runner().loop. After test_shutdown_worker, this loop still has the stale callback. When a later test (e.g. test_calc_job_node_get_builder_restart) calls run_until_complete() on the same loop, the stale callback fires loop.stop() prematurely.
Why the _reset_runner autouse fixture masked the issue
The _reset_runner fixture called manager.reset_runner() after every test, which called runner.close() -> loop.close(). This closed the loop entirely, so get_or_create_event_loop() created a fresh loop for the next test without the stale callback. Removing this fixture exposed the underlying bug.
Why run_forever fixes it
In the real daemon, start_daemon_worker() uses loop.run_forever(), not loop.run_until_complete(). run_forever() does not add a _run_until_complete_cb, so loop.stop() cleanly exits the loop with no stale callbacks. The fix changes the test to use the same pattern: schedule shutdown_worker as a task, then call run_forever(). This mirrors production behavior and avoids the stale callback.
Minimal example to reproduce