Fix test_shutdown_worker leaving stale callback that breaks later tests by agoscinski · Pull Request #7279 · aiidateam/aiida-core

agoscinski · 2026-03-11T06:04:00Z

So in the end it is really a minor issue with a lot of complexity. It only affects the tests. This fix requires first #7268 to be merged because it relies on some fixes that release resources in the test (the manager fixture).

Root cause

test_shutdown_worker was an async test (@pytest.mark.asyncio), so pytest-asyncio ran it via loop.run_until_complete(). Inside the test, shutdown_worker() calls runner.close() -> loop.stop(). Calling loop.stop() from within loop.run_until_complete() leaves a stale _run_until_complete_cb in the event loop's ready queue. This callback calls loop.stop() when it fires, which poisons the next run_until_complete() call on the same loop with: RuntimeError: Event loop stopped before Future completed

Why the next test is affected

The event_loop fixture in conftest.py returns manager.get_runner().loop. After test_shutdown_worker, this loop still has the stale callback. When a later test (e.g. test_calc_job_node_get_builder_restart) calls run_until_complete() on the same loop, the stale callback fires loop.stop() prematurely.

Why the _reset_runner autouse fixture masked the issue

The _reset_runner fixture called manager.reset_runner() after every test, which called runner.close() -> loop.close(). This closed the loop entirely, so get_or_create_event_loop() created a fresh loop for the next test without the stale callback. Removing this fixture exposed the underlying bug.

Why run_forever fixes it

In the real daemon, start_daemon_worker() uses loop.run_forever(), not loop.run_until_complete(). run_forever() does not add a _run_until_complete_cb, so loop.stop() cleanly exits the loop with no stale callbacks. The fix changes the test to use the same pattern: schedule shutdown_worker as a task, then call run_forever(). This mirrors production behavior and avoids the stale callback.

Minimal example to reproduce

"""Minimal example demonstrating how loop.stop() inside run_until_complete()
leaves a stale _run_until_complete_cb that breaks subsequent calls.

This is the root cause of the intermittent CI failure:
  RuntimeError: Event loop stopped before Future completed
"""
import asyncio


# === Working: loop.stop() inside run_forever() ===
# This is how the real daemon works. No stale callback.

loop = asyncio.new_event_loop()

async def shutdown_via_run_forever():
    await asyncio.sleep(0)
    loop.stop()  # cleanly exits run_forever()

loop.create_task(shutdown_via_run_forever())
loop.run_forever()
print(f"After run_forever + loop.stop(): _ready has {len(loop._ready)} stale items")
loop.close()


# === Broken: loop.stop() inside run_until_complete() ===
# This is what the async test did via pytest-asyncio.
# run_until_complete() adds _run_until_complete_cb to the task.
# When loop.stop() exits the loop early, that callback is never
# processed and stays in _ready.

loop = asyncio.new_event_loop()

async def shutdown_via_run_until_complete():
    await asyncio.sleep(0)
    loop.stop()  # exits run_until_complete() early

loop.run_until_complete(shutdown_via_run_until_complete())

n_stale = len(loop._ready)
print(f"After run_until_complete + loop.stop(): _ready has {n_stale} stale items")
for h in loop._ready:
    cb = getattr(h, '_callback', None)
    print(f"  stale callback: {getattr(cb, '__name__', repr(cb))}")


# === The stale callback poisons the next run_until_complete() ===

async def innocent_coroutine():
    await asyncio.sleep(0)  # yield so the stale callback fires first
    return 42

try:
    result = loop.run_until_complete(innocent_coroutine())
    print(f"Next call succeeded: {result}")
except RuntimeError as e:
    print(f"Next call FAILED: {e}")

loop.close()

codecov · 2026-03-11T06:05:58Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.90%. Comparing base (57378d6) to head (8ca8906).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7279      +/-   ##
==========================================
- Coverage   79.90%   79.90%   -0.00%     
==========================================
  Files         568      568              
  Lines       43984    43984              
==========================================
- Hits        35140    35139       -1     
- Misses       8844     8845       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

danielhollas

Thanks for a thorough analysis! Makes sense to me.

danielhollas · 2026-03-11T15:27:28Z

-@pytest.fixture(autouse=True)
-def _reset_runner(request):
-    yield
-    get_manager().reset_runner()


I guess deleting this fixture is only safe if we modify the manager fixture in the previous PR?

Yes exactly by commit 46b6f66 we fix the bad fixture usage requirement of reseting the runner inside the test when using the manager fixture. There was however the subsequent issue with test_shutdown_worker that did not allow use to remove it in the PR #7268. Because its quite complicated requiring to understand implementation details from the python event loop, a different PR for discussion made also sense.

khsrali · 2026-03-12T09:15:47Z

-        assert runner.is_closed()
-    finally:
-        # Reset the runner of the manager, because once closed it cannot be reused by other tests.
-        manager._runner = None


why this is no longer needed?

manager._runner = None I mean

because the manager fixture was properly fixed in 46b6f66 In principle we could have removed this part already in commit 46b6f66 but I just did not touch this test because I did understand then.

danielhollas

I can't say I went through this in all the detail but it looks good to me as long as CI passes :-) (might be worth re-running CI couple times, or run the presto test suite locally in a loop for a while to verify that there are no other side-effects of changes here.

Root cause: test_shutdown_worker was an async test (@pytest.mark.asyncio), so pytest-asyncio ran it via loop.run_until_complete(). Inside the test, shutdown_worker() calls runner.close() -> loop.stop(). Calling loop.stop() from within loop.run_until_complete() leaves a stale _run_until_complete_cb in the event loop's ready queue. This callback calls loop.stop() when it fires, which poisons the next run_until_complete() call on the same loop with: RuntimeError: Event loop stopped before Future completed Why the next test is affected: The event_loop fixture in conftest.py returns manager.get_runner().loop. After test_shutdown_worker, this loop still has the stale callback. When a later test (e.g. test_calc_job_node_get_builder_restart) calls run_until_complete() on the same loop, the stale callback fires loop.stop() prematurely. Why the _reset_runner autouse fixture masked the issue: The _reset_runner fixture called manager.reset_runner() after every test, which called runner.close() -> loop.close(). This closed the loop entirely, so get_or_create_event_loop() created a fresh loop for the next test without the stale callback. Removing this fixture exposed the underlying bug. Why run_forever fixes it: In the real daemon, start_daemon_worker() uses loop.run_forever(), not loop.run_until_complete(). run_forever() does not add a _run_until_complete_cb, so loop.stop() cleanly exits the loop with no stale callbacks. The fix changes the test to use the same pattern: schedule shutdown_worker as a task, then call run_forever(). This mirrors production behavior and avoids the stale callback.

agoscinski

Summary

So the CI errored out after the rebase. There was the problem that we have tests that share runner. I think its again flaky test, and we were lucky that this was reproduced after rebase.

Problem description

So these run_get_node functions they all secretly use a manager in the tests. They do not use the fixture manager, the use get_manager() from aiida. Further, manager is a singleton. Further, manager.get_runner() uses the same runner once initialized. Lastly, as long as the event loop is open, all runners get the same even loop plumpy.get_or_create_event_loop, because we want that nested aiida Process calls (e.g. calcfunction in a WorkChain) fill their tasks into the same event loop. runner fixture as a well behaved fixture and resets its even loop.

So tests/engine/test_process_function.py::test_plugin_version creates the runner with get_manager.get_runner() implicitly through run_get_node, then the next test tests/engine/test_runners.py::test_call_on_process_finish uses the runner fixture which gets the same event loop through plumpy.get_or_create_event_loop, then fixture teardowns and closes the event loop. Now the event loop from gget_manager.get_runner() is invalid. The next test tests/engine/test_runners.py::test_run_return_value_cached gets the same runner that as test tests/engine/test_process_function.py::test_plugin_version produced through get_manager.get_runner(). The event loops now closed however, and the check that the event loop is closed is only done on initialization of get_runner() but this is the second call using the cached runner. The whole thing as ASCII graphic

`tests/engine/test_process_function.py::test_plugin_version`
    |
    | creates / populates
    v
Manager singleton
└── _runner = Runner A
    └── loop = EventLoop L

`tests/engine/test_runners.py::test_call_on_process_finish`
    |
    | creates local fixture runner
    v
Runner B
└── loop = EventLoop L
    |
    | fixture teardown calls `runner.close()`
    v
closes EventLoop L


`tests/engine/test_runners.py::test_run_return_value_cached`
    |
    | later reuses Runner A via `get_manager().get_runner()`
    v
uses EventLoop L (already closed)

Fix

Well the easiest fix was to make the runner fixture dependent on the manager fixture, by that the runner fixture is also becoming a singleton'ish through the test and we do not have two runner instances conflicting with each other.

Future work

There is a bigger subsequent question. Should Runner not be also a singleton given the fact that two Runner instances can share the same event loop. But maybe the design decision that simplifies the nested calling of aiida processes to reuse the same event loop through get_or_create_event_loop was not a good call, because it makes Runners sharing a state. But I don't know how to solve this in a simple way. In the serialization of the process state, we do not persist the event loop, so on recovery assuming that there is only one event loop in all runners simplified this problem. On the other hand we have runner singletonized by manager.get_runner so this should be a fine solution. Its just that the manager only exists in aiida-core and not plumpy. So maybe we improve this when we move plumpy into aiida-core.

Prevent tests from closing the shared manager runner directly. The original CI failure was caused by tests creating a custom runner that could still share the event loop of the cached global manager runner. When the custom runner was closed in teardown, it closed that shared loop as well. A later test then reused the cached manager runner and failed because it still pointed to the closed loop. This commit makes that contract explicit in the test fixtures. Tests are no longer allowed to close the global manager runner. Instead, tests that need shutdown semantics must use isolated runner fixtures with their own event loop. For tests that need to exercise manager-based shutdown, an isolated runner can be temporarily installed as the manager runner. The CalcJob caching test is also adjusted to clear the plugin version cache directly instead of resetting the global runner, which would now violate the stronger isolation policy.

khsrali

LGTM!

Root cause: test_shutdown_worker was an async test (@pytest.mark.asyncio), so pytest-asyncio ran it via loop.run_until_complete(). Inside the test, shutdown_worker() calls runner.close() -> loop.stop(). Calling loop.stop() from within loop.run_until_complete() leaves a stale _run_until_complete_cb in the event loop's ready queue. This callback calls loop.stop() when it fires, which poisons the next run_until_complete() call on the same loop with: RuntimeError: Event loop stopped before Future completed Why the next test is affected: The event_loop fixture in conftest.py returns manager.get_runner().loop. After test_shutdown_worker, this loop still has the stale callback. When a later test (e.g. test_calc_job_node_get_builder_restart) calls run_until_complete() on the same loop, the stale callback fires loop.stop() prematurely. Why the _reset_runner autouse fixture masked the issue: The _reset_runner fixture called manager.reset_runner() after every test, which called runner.close() -> loop.close(). This closed the loop entirely, so get_or_create_event_loop() created a fresh loop for the next test without the stale callback. Removing this fixture exposed the underlying bug. Why run_forever fixes it: In the real daemon, start_daemon_worker() uses loop.run_forever(), not loop.run_until_complete(). run_forever() does not add a _run_until_complete_cb, so loop.stop() cleanly exits the loop with no stale callbacks. The fix changes the test to use the same pattern: schedule shutdown_worker as a task, then call run_forever(). This mirrors production behavior and avoids the stale callback.

agoscinski requested a review from khsrali March 11, 2026 06:04

agoscinski mentioned this pull request Mar 11, 2026

Fix resource leaks in storage, runner and broker #7268

Merged

agoscinski force-pushed the fix-stale-task-in-tests branch 2 times, most recently from a045f73 to 89a8172 Compare March 11, 2026 09:00

agoscinski changed the base branch from main to fix-leak March 11, 2026 09:02

danielhollas reviewed Mar 11, 2026

View reviewed changes

khsrali reviewed Mar 12, 2026

View reviewed changes

agoscinski added this to aiida-core v2.8.1 Mar 12, 2026

agoscinski added the pr/blocked PR is blocked by another PR that should be merged first label Mar 12, 2026

agoscinski changed the base branch from fix-leak to main April 14, 2026 21:02

agoscinski removed pr/blocked PR is blocked by another PR that should be merged first labels Apr 15, 2026

agoscinski force-pushed the fix-stale-task-in-tests branch from 89a8172 to e69634e Compare April 20, 2026 09:01

danielhollas self-requested a review April 20, 2026 09:05

danielhollas approved these changes Apr 20, 2026

View reviewed changes

agoscinski marked this pull request as draft April 20, 2026 12:35

agoscinski force-pushed the fix-stale-task-in-tests branch from e69634e to 41dac39 Compare April 20, 2026 12:37

agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Apr 20, 2026

🧪 Reset manager state in runner tests (aiidateam#7279)

200cabb

agoscinski commented Apr 20, 2026

View reviewed changes

agoscinski force-pushed the fix-stale-task-in-tests branch from 200cabb to 4087193 Compare April 20, 2026 17:19

agoscinski force-pushed the fix-stale-task-in-tests branch from 4087193 to 8ca8906 Compare April 20, 2026 17:30

agoscinski marked this pull request as ready for review April 20, 2026 17:30

khsrali approved these changes Apr 21, 2026

View reviewed changes

agoscinski merged commit d894b9a into aiidateam:main Apr 21, 2026
15 checks passed

agoscinski deleted the fix-stale-task-in-tests branch April 21, 2026 08:58

github-project-automation Bot moved this to Done in aiida-core v2.8.1 Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix test_shutdown_worker leaving stale callback that breaks later tests#7279

Fix test_shutdown_worker leaving stale callback that breaks later tests#7279
agoscinski merged 2 commits into
aiidateam:mainfrom
agoscinski:fix-stale-task-in-tests

agoscinski commented Mar 11, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

danielhollas left a comment

Uh oh!

danielhollas Mar 11, 2026

Uh oh!

agoscinski Apr 20, 2026 •

edited

Loading

Uh oh!

khsrali Mar 12, 2026

Uh oh!

khsrali Mar 12, 2026

Uh oh!

agoscinski Apr 20, 2026 •

edited

Loading

Uh oh!

danielhollas left a comment

Uh oh!

agoscinski left a comment •

edited

Loading

Uh oh!

khsrali left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

agoscinski commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Why the next test is affected

Why the _reset_runner autouse fixture masked the issue

Why run_forever fixes it

Minimal example to reproduce

Uh oh!

codecov Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

danielhollas left a comment

Choose a reason for hiding this comment

Uh oh!

danielhollas Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

agoscinski Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khsrali Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

khsrali Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

agoscinski Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielhollas left a comment

Choose a reason for hiding this comment

Uh oh!

agoscinski left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Summary

Problem description

Fix

Future work

Uh oh!

khsrali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

agoscinski commented Mar 11, 2026 •

edited

Loading

codecov Bot commented Mar 11, 2026 •

edited

Loading

agoscinski Apr 20, 2026 •

edited

Loading

agoscinski Apr 20, 2026 •

edited

Loading

agoscinski left a comment •

edited

Loading