Skip to content

Fix test_shutdown_worker leaving stale callback that breaks later tests#7279

Merged
agoscinski merged 2 commits into
aiidateam:mainfrom
agoscinski:fix-stale-task-in-tests
Apr 21, 2026
Merged

Fix test_shutdown_worker leaving stale callback that breaks later tests#7279
agoscinski merged 2 commits into
aiidateam:mainfrom
agoscinski:fix-stale-task-in-tests

Conversation

@agoscinski
Copy link
Copy Markdown
Collaborator

@agoscinski agoscinski commented Mar 11, 2026

So in the end it is really a minor issue with a lot of complexity. It only affects the tests. This fix requires first #7268 to be merged because it relies on some fixes that release resources in the test (the manager fixture).

Root cause

test_shutdown_worker was an async test (@pytest.mark.asyncio), so pytest-asyncio ran it via loop.run_until_complete(). Inside the test, shutdown_worker() calls runner.close() -> loop.stop(). Calling loop.stop() from within loop.run_until_complete() leaves a stale _run_until_complete_cb in the event loop's ready queue. This callback calls loop.stop() when it fires, which poisons the next run_until_complete() call on the same loop with: RuntimeError: Event loop stopped before Future completed

Why the next test is affected

The event_loop fixture in conftest.py returns manager.get_runner().loop. After test_shutdown_worker, this loop still has the stale callback. When a later test (e.g. test_calc_job_node_get_builder_restart) calls run_until_complete() on the same loop, the stale callback fires loop.stop() prematurely.

Why the _reset_runner autouse fixture masked the issue

The _reset_runner fixture called manager.reset_runner() after every test, which called runner.close() -> loop.close(). This closed the loop entirely, so get_or_create_event_loop() created a fresh loop for the next test without the stale callback. Removing this fixture exposed the underlying bug.

Why run_forever fixes it

In the real daemon, start_daemon_worker() uses loop.run_forever(), not loop.run_until_complete(). run_forever() does not add a _run_until_complete_cb, so loop.stop() cleanly exits the loop with no stale callbacks. The fix changes the test to use the same pattern: schedule shutdown_worker as a task, then call run_forever(). This mirrors production behavior and avoids the stale callback.

Minimal example to reproduce

"""Minimal example demonstrating how loop.stop() inside run_until_complete()
leaves a stale _run_until_complete_cb that breaks subsequent calls.

This is the root cause of the intermittent CI failure:
  RuntimeError: Event loop stopped before Future completed
"""
import asyncio


# === Working: loop.stop() inside run_forever() ===
# This is how the real daemon works. No stale callback.

loop = asyncio.new_event_loop()

async def shutdown_via_run_forever():
    await asyncio.sleep(0)
    loop.stop()  # cleanly exits run_forever()

loop.create_task(shutdown_via_run_forever())
loop.run_forever()
print(f"After run_forever + loop.stop(): _ready has {len(loop._ready)} stale items")
loop.close()


# === Broken: loop.stop() inside run_until_complete() ===
# This is what the async test did via pytest-asyncio.
# run_until_complete() adds _run_until_complete_cb to the task.
# When loop.stop() exits the loop early, that callback is never
# processed and stays in _ready.

loop = asyncio.new_event_loop()

async def shutdown_via_run_until_complete():
    await asyncio.sleep(0)
    loop.stop()  # exits run_until_complete() early

loop.run_until_complete(shutdown_via_run_until_complete())

n_stale = len(loop._ready)
print(f"After run_until_complete + loop.stop(): _ready has {n_stale} stale items")
for h in loop._ready:
    cb = getattr(h, '_callback', None)
    print(f"  stale callback: {getattr(cb, '__name__', repr(cb))}")


# === The stale callback poisons the next run_until_complete() ===

async def innocent_coroutine():
    await asyncio.sleep(0)  # yield so the stale callback fires first
    return 42

try:
    result = loop.run_until_complete(innocent_coroutine())
    print(f"Next call succeeded: {result}")
except RuntimeError as e:
    print(f"Next call FAILED: {e}")

loop.close()

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.90%. Comparing base (57378d6) to head (8ca8906).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7279      +/-   ##
==========================================
- Coverage   79.90%   79.90%   -0.00%     
==========================================
  Files         568      568              
  Lines       43984    43984              
==========================================
- Hits        35140    35139       -1     
- Misses       8844     8845       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@agoscinski agoscinski force-pushed the fix-stale-task-in-tests branch 2 times, most recently from a045f73 to 89a8172 Compare March 11, 2026 09:00
@agoscinski agoscinski changed the base branch from main to fix-leak March 11, 2026 09:02
Copy link
Copy Markdown
Collaborator

@danielhollas danielhollas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for a thorough analysis! Makes sense to me.

Comment thread tests/conftest.py
@pytest.fixture(autouse=True)
def _reset_runner(request):
yield
get_manager().reset_runner()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess deleting this fixture is only safe if we modify the manager fixture in the previous PR?

Copy link
Copy Markdown
Collaborator Author

@agoscinski agoscinski Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly by commit 46b6f66 we fix the bad fixture usage requirement of reseting the runner inside the test when using the manager fixture. There was however the subsequent issue with test_shutdown_worker that did not allow use to remove it in the PR #7268. Because its quite complicated requiring to understand implementation details from the python event loop, a different PR for discussion made also sense.

assert runner.is_closed()
finally:
# Reset the runner of the manager, because once closed it cannot be reused by other tests.
manager._runner = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is no longer needed?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manager._runner = None I mean

Copy link
Copy Markdown
Collaborator Author

@agoscinski agoscinski Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the manager fixture was properly fixed in 46b6f66 In principle we could have removed this part already in commit 46b6f66 but I just did not touch this test because I did understand then.

@agoscinski agoscinski added the pr/blocked PR is blocked by another PR that should be merged first label Mar 12, 2026
@agoscinski agoscinski changed the base branch from fix-leak to main April 14, 2026 21:02
@agoscinski agoscinski removed pr/blocked PR is blocked by another PR that should be merged first labels Apr 15, 2026
@agoscinski agoscinski force-pushed the fix-stale-task-in-tests branch from 89a8172 to e69634e Compare April 20, 2026 09:01
@danielhollas danielhollas self-requested a review April 20, 2026 09:05
Copy link
Copy Markdown
Collaborator

@danielhollas danielhollas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I went through this in all the detail but it looks good to me as long as CI passes :-) (might be worth re-running CI couple times, or run the presto test suite locally in a loop for a while to verify that there are no other side-effects of changes here.

Root cause:
  test_shutdown_worker was an async test (@pytest.mark.asyncio), so
  pytest-asyncio ran it via loop.run_until_complete(). Inside the test,
  shutdown_worker() calls runner.close() -> loop.stop(). Calling
  loop.stop() from within loop.run_until_complete() leaves a stale
  _run_until_complete_cb in the event loop's ready queue. This callback
  calls loop.stop() when it fires, which poisons the next
  run_until_complete() call on the same loop with:
    RuntimeError: Event loop stopped before Future completed

Why the next test is affected:
  The event_loop fixture in conftest.py returns manager.get_runner().loop.
  After test_shutdown_worker, this loop still has the stale callback.
  When a later test (e.g. test_calc_job_node_get_builder_restart) calls
  run_until_complete() on the same loop, the stale callback fires
  loop.stop() prematurely.

Why the _reset_runner autouse fixture masked the issue:
  The _reset_runner fixture called manager.reset_runner() after every
  test, which called runner.close() -> loop.close(). This closed the
  loop entirely, so get_or_create_event_loop() created a fresh loop
  for the next test without the stale callback. Removing this fixture
  exposed the underlying bug.

Why run_forever fixes it:
  In the real daemon, start_daemon_worker() uses loop.run_forever(),
  not loop.run_until_complete(). run_forever() does not add a
  _run_until_complete_cb, so loop.stop() cleanly exits the loop with
  no stale callbacks. The fix changes the test to use the same pattern:
  schedule shutdown_worker as a task, then call run_forever(). This
  mirrors production behavior and avoids the stale callback.
@agoscinski agoscinski marked this pull request as draft April 20, 2026 12:35
@agoscinski agoscinski force-pushed the fix-stale-task-in-tests branch from e69634e to 41dac39 Compare April 20, 2026 12:37
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Apr 20, 2026
Copy link
Copy Markdown
Collaborator Author

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

So the CI errored out after the rebase. There was the problem that we have tests that share runner. I think its again flaky test, and we were lucky that this was reproduced after rebase.

Problem description

So these run_get_node functions they all secretly use a manager in the tests. They do not use the fixture manager, the use get_manager() from aiida. Further, manager is a singleton. Further, manager.get_runner() uses the same runner once initialized. Lastly, as long as the event loop is open, all runners get the same even loop plumpy.get_or_create_event_loop, because we want that nested aiida Process calls (e.g. calcfunction in a WorkChain) fill their tasks into the same event loop. runner fixture as a well behaved fixture and resets its even loop.

So tests/engine/test_process_function.py::test_plugin_version creates the runner with get_manager.get_runner() implicitly through run_get_node, then the next test tests/engine/test_runners.py::test_call_on_process_finish uses the runner fixture which gets the same event loop through plumpy.get_or_create_event_loop, then fixture teardowns and closes the event loop. Now the event loop from gget_manager.get_runner() is invalid. The next test tests/engine/test_runners.py::test_run_return_value_cached gets the same runner that as test tests/engine/test_process_function.py::test_plugin_version produced through get_manager.get_runner(). The event loops now closed however, and the check that the event loop is closed is only done on initialization of get_runner() but this is the second call using the cached runner. The whole thing as ASCII graphic

`tests/engine/test_process_function.py::test_plugin_version`
    |
    | creates / populates
    v
Manager singleton
└── _runner = Runner A
    └── loop = EventLoop L

`tests/engine/test_runners.py::test_call_on_process_finish`
    |
    | creates local fixture runner
    v
Runner B
└── loop = EventLoop L
    |
    | fixture teardown calls `runner.close()`
    v
closes EventLoop L


`tests/engine/test_runners.py::test_run_return_value_cached`
    |
    | later reuses Runner A via `get_manager().get_runner()`
    v
uses EventLoop L (already closed)

Fix

Well the easiest fix was to make the runner fixture dependent on the manager fixture, by that the runner fixture is also becoming a singleton'ish through the test and we do not have two runner instances conflicting with each other.

Future work

There is a bigger subsequent question. Should Runner not be also a singleton given the fact that two Runner instances can share the same event loop. But maybe the design decision that simplifies the nested calling of aiida processes to reuse the same event loop through get_or_create_event_loop was not a good call, because it makes Runners sharing a state. But I don't know how to solve this in a simple way. In the serialization of the process state, we do not persist the event loop, so on recovery assuming that there is only one event loop in all runners simplified this problem. On the other hand we have runner singletonized by manager.get_runner so this should be a fine solution. Its just that the manager only exists in aiida-core and not plumpy. So maybe we improve this when we move plumpy into aiida-core.

agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Apr 20, 2026
Prevent tests from closing the shared manager runner directly.

The original CI failure was caused by tests creating a custom
runner that could still share the event loop of the cached global
manager runner. When the custom runner was closed in teardown, it
closed that shared loop as well. A later test then reused the
cached manager runner and failed because it still pointed to the
closed loop.

This commit makes that contract explicit in the test fixtures.
Tests are no longer allowed to close the global manager runner.
Instead, tests that need shutdown semantics must use isolated
runner fixtures with their own event loop. For tests that need to
exercise manager-based shutdown, an isolated runner can be
temporarily installed as the manager runner.

The CalcJob caching test is also adjusted to clear the plugin
version cache directly instead of resetting the global runner,
which would now violate the stronger isolation policy.
@agoscinski agoscinski force-pushed the fix-stale-task-in-tests branch from 200cabb to 4087193 Compare April 20, 2026 17:19
Prevent tests from closing the shared manager runner directly.

The original CI failure was caused by tests creating a custom
runner that could still share the event loop of the cached global
manager runner. When the custom runner was closed in teardown, it
closed that shared loop as well. A later test then reused the
cached manager runner and failed because it still pointed to the
closed loop.

This commit makes that contract explicit in the test fixtures.
Tests are no longer allowed to close the global manager runner.
Instead, tests that need shutdown semantics must use isolated
runner fixtures with their own event loop. For tests that need to
exercise manager-based shutdown, an isolated runner can be
temporarily installed as the manager runner.

The CalcJob caching test is also adjusted to clear the plugin
version cache directly instead of resetting the global runner,
which would now violate the stronger isolation policy.
@agoscinski agoscinski force-pushed the fix-stale-task-in-tests branch from 4087193 to 8ca8906 Compare April 20, 2026 17:30
@agoscinski agoscinski marked this pull request as ready for review April 20, 2026 17:30
Copy link
Copy Markdown
Collaborator

@khsrali khsrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@agoscinski agoscinski merged commit d894b9a into aiidateam:main Apr 21, 2026
15 checks passed
agoscinski added a commit that referenced this pull request Apr 21, 2026
Root cause:
  test_shutdown_worker was an async test (@pytest.mark.asyncio), so
  pytest-asyncio ran it via loop.run_until_complete(). Inside the test,
  shutdown_worker() calls runner.close() -> loop.stop(). Calling
  loop.stop() from within loop.run_until_complete() leaves a stale
  _run_until_complete_cb in the event loop's ready queue. This callback
  calls loop.stop() when it fires, which poisons the next
  run_until_complete() call on the same loop with:
    RuntimeError: Event loop stopped before Future completed

Why the next test is affected:
  The event_loop fixture in conftest.py returns manager.get_runner().loop.
  After test_shutdown_worker, this loop still has the stale callback.
  When a later test (e.g. test_calc_job_node_get_builder_restart) calls
  run_until_complete() on the same loop, the stale callback fires
  loop.stop() prematurely.

Why the _reset_runner autouse fixture masked the issue:
  The _reset_runner fixture called manager.reset_runner() after every
  test, which called runner.close() -> loop.close(). This closed the
  loop entirely, so get_or_create_event_loop() created a fresh loop
  for the next test without the stale callback. Removing this fixture
  exposed the underlying bug.

Why run_forever fixes it:
  In the real daemon, start_daemon_worker() uses loop.run_forever(),
  not loop.run_until_complete(). run_forever() does not add a
  _run_until_complete_cb, so loop.stop() cleanly exits the loop with
  no stale callbacks. The fix changes the test to use the same pattern:
  schedule shutdown_worker as a task, then call run_forever(). This
  mirrors production behavior and avoids the stale callback.
@agoscinski agoscinski deleted the fix-stale-task-in-tests branch April 21, 2026 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants