Skip to content

Fix flaky test_unload_profile by properly removing scoped session#7281

Merged
agoscinski merged 3 commits into
aiidateam:mainfrom
agoscinski:fix-weakref
Mar 26, 2026
Merged

Fix flaky test_unload_profile by properly removing scoped session#7281
agoscinski merged 3 commits into
aiidateam:mainfrom
agoscinski:fix-weakref

Conversation

@agoscinski
Copy link
Copy Markdown
Collaborator

@agoscinski agoscinski commented Mar 11, 2026

Fixes flaky test, e.g. in https://github.com/agoscinski/aiida-core/actions/runs/22924322496/job/66530719671

FAILED tests/storage/psql_dos/test_backend.py::test_unload_profile - AssertionError: <WeakValueDictionary at 0x7f74710caea0>
assert 1 == (1 - 1)
 +  where 1 = len(<WeakValueDictionary at 0x7f74710caea0>)
= 1 failed, 3422 passed, 54 skipped, 1 xfailed, 42 warnings in 1338.81s (0:22:18) =

Commit 1: Missing remove in storage backend

The test_unload_profile test checks that SQLAlchemy's internal _sessions WeakValueDictionary shrinks after unloading the profile. This test was flaky for two reasons:

  1. The backend close() method called scoped_session.close() but not scoped_session.remove(). close() releases the database connection and resets the session state, but the session object remains registered in the scoped_session's thread-local registry. Since _sessions is a WeakValueDictionary, entries are only removed when all strong references to the session are gone. The thread-local registry held a strong reference, preventing cleanup. Adding remove() explicitly clears the session from the thread-local registry, which drops the last strong reference and lets the WeakValueDictionary entry be collected immediately.

    This removes the gc.collect() workaround in close() that was previously needed to trigger cleanup of the weak reference. With remove(), the reference is dropped deterministically.

  2. The test assumed that previous tests had already created at least one session (assert current_sessions != 0). When run in isolation, no sessions exist and the test fails at the first assertion. Fix by explicitly creating a session via get_session() before checking counts.

    The gc.collect() in the test is also removed. It was needed to clean up sessions leaked by earlier tests whose close() didn't call remove(), so sessions lingered in the WeakValueDictionary until GC collected them. Now that close() calls remove(), sessions from previous tests are already gone deterministically.

Script reproducing the error, showing that remove really removes it

"""Reproduce the scoped_session leak in SQLAlchemy's _sessions WeakValueDictionary.

Demonstrates that scoped_session.close() does NOT remove the session from
SQLAlchemy's global _sessions registry, while scoped_session.remove() does.

This is the root cause of the flaky test_unload_profile test.
"""
from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
from sqlalchemy.orm.session import _sessions


engine = create_engine('sqlite://')

# === Without remove(): session lingers in _sessions ===
print('--- close() without remove() ---')
before = len(_sessions)
factory = scoped_session(sessionmaker(bind=engine))
factory()  # creates and registers session in _sessions
print(f'After creating session: {len(_sessions) - before} new entries in _sessions')

factory.expunge_all()
factory.close()
after_close = len(_sessions)
print(f'After close(): {after_close - before} entries remain (expected 0, session leaked!)')

# === With remove(): session is cleaned up ===
print('\n--- close() + remove() ---')
before = len(_sessions)
factory2 = scoped_session(sessionmaker(bind=engine))
factory2()
print(f'After creating session: {len(_sessions) - before} new entries in _sessions')

factory2.expunge_all()
factory2.close()
factory2.remove()
after_remove = len(_sessions)
print(f'After close() + remove(): {after_remove - before} entries remain (cleaned up!)')

engine.dispose()

Commit 2: Cyclic ownership between SqlaQueryBuilder and SqlaJoiner

Read second commit message.

Commit 3: Remove unnecessary monkeypatching of the trajectory

Covers what is described in the comment #7281 (comment)

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.86%. Comparing base (46b6f66) to head (7ee8e51).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7281      +/-   ##
==========================================
- Coverage   79.87%   79.86%   -0.01%     
==========================================
  Files         566      566              
  Lines       43938    43936       -2     
==========================================
- Hits        35091    35084       -7     
- Misses       8847     8852       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@agoscinski agoscinski changed the base branch from main to fix-leak March 11, 2026 09:02
@agoscinski agoscinski changed the base branch from fix-leak to main March 11, 2026 09:02
@agoscinski agoscinski changed the base branch from main to fix-leak March 11, 2026 10:19
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Mar 12, 2026
…idateam#7281)

The test_unload_profile test checks that SQLAlchemy's internal
_sessions WeakValueDictionary shrinks after unloading the profile.
This test was flaky for two reasons:

1. The backend close() method called scoped_session.close() but not
   scoped_session.remove(). close() releases the database connection
   and resets the session state, but the session object remains
   registered in the scoped_session's thread-local registry. Since
   _sessions is a WeakValueDictionary, entries are only removed when
   all strong references to the session are gone. The thread-local
   registry held a strong reference, preventing cleanup. Adding
   remove() explicitly clears the session from the thread-local
   registry, which drops the last strong reference and lets the
   WeakValueDictionary entry be collected immediately.

   This removes the gc.collect() workaround in close() that was
   previously needed to trigger cleanup of the weak reference. With
   remove(), the reference is dropped deterministically.

2. The test assumed that previous tests had already created at least
   one session (assert current_sessions != 0). When run in isolation,
   no sessions exist and the test fails at the first assertion. Fix
   by explicitly creating a session via get_session() before checking
   counts.

   The gc.collect() in the test is also removed. It was needed to
   clean up sessions leaked by earlier tests whose close() didn't
   call remove(), so sessions lingered in the WeakValueDictionary
   until GC collected them. Now that close() calls remove(),
   sessions from previous tests are already gone deterministically.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Mar 12, 2026
…SqlaJoiner (aiidateam#7281)

When `StorageBackend.default_user` is accessed, it creates a `QueryBuilder`
which internally creates a `SqlaQueryBuilder` and a `SqlaJoiner`. These two
objects formed a reference cycle:

  SqlaQueryBuilder._joiner → SqlaJoiner
  SqlaJoiner._entities     → SqlaQueryBuilder
  SqlaJoiner._build_filters → bound method → SqlaQueryBuilder

Because of this cycle, when the QueryBuilder went out of scope (e.g. after
`default_user` returned), CPython's reference counting could not reclaim
the objects immediately — they had to wait for the cyclic garbage collector.

The problem is that SqlaQueryBuilder._query_cache holds a BuiltQuery, which
contains an SQLAlchemy Query object that holds a strong reference to the
Session. So the ownership chain was:

  SqlaQueryBuilder (prevented from collection by cycle)
    → _query_cache → BuiltQuery → Query → Session

This meant the Session object stayed alive after `storage.close()` called
`session_factory.remove()`, because the weak reference in SQLAlchemy's
`_sessions` WeakValueDictionary still had a live referent. The session
would only be cleaned up when the cyclic GC eventually ran, making the
`test_unload_profile` test flaky depending on test ordering (it failed
when `test_default_user` ran first and populated the `_default_user` cache).

The fix breaks the cycle by:

1. Using `weakref.proxy(entity_mapper)` in SqlaJoiner instead of a strong
   reference. The joiner only uses entity_mapper for class-level attribute
   access (Node, User, Link, etc.) and calling build_filters(), so a proxy
   is safe — the SqlaQueryBuilder always outlives its SqlaJoiner.

2. Removing the separate `_build_filters` callback (which was a bound
   method creating a second strong back-reference). SqlaJoiner now calls
   `self._entities.build_filters()` through the weakref proxy instead.
@agoscinski agoscinski marked this pull request as ready for review March 12, 2026 07:01
@agoscinski agoscinski requested a review from GeigerJ2 March 12, 2026 07:03
@agoscinski agoscinski added the pr/blocked PR is blocked by another PR that should be merged first label Mar 12, 2026
@GeigerJ2
Copy link
Copy Markdown
Collaborator

GeigerJ2 commented Mar 23, 2026

Hm, with the fixes of the PR in place, I believe the mpl_pos workaround in tests/cmdline/commands/test_data.py can also be removed (found this randomly, by accident), see here:

if fmt in ['mpl_pos']:
# This has to be mocked because ``plot_positions_xyz`` imports ``matplotlib.pyplot`` and for some completely
# unknown reason, causes ``tests/storage/psql_dos/test_backend.py::test_unload_profile`` to fail. For some
# reason, merely importing ``matplotlib`` (even here directly in the test) will cause that test to claim
# that there still is something holding on to a reference of an sqlalchemy session that it keeps track of
# in the ``sqlalchemy.orm.session._sessions`` weak ref dictionary. Since it is impossible to figure out why
# the hell importing matplotlib would interact with sqlalchemy sessions, the function that does the import
# is simply mocked out for now.
from aiida.orm.nodes.data.array import trajectory
monkeypatch.setattr(trajectory, 'plot_positions_XYZ', lambda *args, **kwargs: None)
run_cli_command(cmd_trajectory.trajectory_show, options, use_subprocess=False)

This mock existed because importing matplotlib kept an extra reference alive in _sessions, which made test_unload_profile fail when it relied on gc.collect() for cleanup. Now that session cleanup is deterministic (remove() + broken reference cycle), it shouldn't matter what matplotlib does behind the scenes. Might be worth removing it in this PR and add a test? (adding it already, before posting my pending review comments)

Copy link
Copy Markdown
Collaborator

@GeigerJ2 GeigerJ2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, what a journey this PR review was. Great work, thanks, @agoscinski! Just one minor nitpick, but overall, let's get this baby in 🚀

Comment thread src/aiida/storage/psql_dos/backend.py Outdated
Copy link
Copy Markdown
Collaborator

@danielhollas danielhollas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, who knew we had gc.collect so deep in the storage innards. I feel like I should block some time to study this PR more in depth 😁

…idateam#7281)

The test_unload_profile test checks that SQLAlchemy's internal
_sessions WeakValueDictionary shrinks after unloading the profile.
This test was flaky for two reasons:

1. The backend close() method called scoped_session.close() but not
   scoped_session.remove(). close() releases the database connection
   and resets the session state, but the session object remains
   registered in the scoped_session's thread-local registry. Since
   _sessions is a WeakValueDictionary, entries are only removed when
   all strong references to the session are gone. The thread-local
   registry held a strong reference, preventing cleanup. Adding
   remove() explicitly clears the session from the thread-local
   registry, which drops the last strong reference and lets the
   WeakValueDictionary entry be collected immediately.

   This removes the gc.collect() workaround in close() that was
   previously needed to trigger cleanup of the weak reference. With
   remove(), the reference is dropped deterministically.

2. The test assumed that previous tests had already created at least
   one session (assert current_sessions != 0). When run in isolation,
   no sessions exist and the test fails at the first assertion. Fix
   by explicitly creating a session via get_session() before checking
   counts.

   The gc.collect() in the test is also removed. It was needed to
   clean up sessions leaked by earlier tests whose close() didn't
   call remove(), so sessions lingered in the WeakValueDictionary
   until GC collected them. Now that close() calls remove(),
   sessions from previous tests are already gone deterministically.
…SqlaJoiner (aiidateam#7281)

When `StorageBackend.default_user` is accessed, it creates a `QueryBuilder`
which internally creates a `SqlaQueryBuilder` and a `SqlaJoiner`. These two
objects formed a reference cycle:

  SqlaQueryBuilder._joiner → SqlaJoiner
  SqlaJoiner._entities     → SqlaQueryBuilder
  SqlaJoiner._build_filters → bound method → SqlaQueryBuilder

Because of this cycle, when the QueryBuilder went out of scope (e.g. after
`default_user` returned), CPython's reference counting could not reclaim
the objects immediately — they had to wait for the cyclic garbage collector.

The problem is that SqlaQueryBuilder._query_cache holds a BuiltQuery, which
contains an SQLAlchemy Query object that holds a strong reference to the
Session. So the ownership chain was:

  SqlaQueryBuilder (prevented from collection by cycle)
    → _query_cache → BuiltQuery → Query → Session

This meant the Session object stayed alive after `storage.close()` called
`session_factory.remove()`, because the weak reference in SQLAlchemy's
`_sessions` WeakValueDictionary still had a live referent. The session
would only be cleaned up when the cyclic GC eventually ran, making the
`test_unload_profile` test flaky depending on test ordering (it failed
when `test_default_user` ran first and populated the `_default_user` cache).

The fix breaks the cycle by:

1. Using `weakref.proxy(entity_mapper)` in SqlaJoiner instead of a strong
   reference. The joiner only uses entity_mapper for class-level attribute
   access (Node, User, Link, etc.) and calling build_filters(), so a proxy
   is safe — the SqlaQueryBuilder always outlives its SqlaJoiner.

2. Removing the separate `_build_filters` callback (which was a bound
   method creating a second strong back-reference). SqlaJoiner now calls
   `self._entities.build_filters()` through the weakref proxy instead.
@agoscinski agoscinski changed the base branch from fix-leak to main March 26, 2026 13:22
@agoscinski agoscinski removed the pr/blocked PR is blocked by another PR that should be merged first label Mar 26, 2026
The mock existed because importing matplotlib kept an extra reference
alive in sqlalchemy's _sessions weakref dict, causing test_unload_profile
to fail. With deterministic session cleanup (remove() + broken reference
cycle), this workaround is no longer needed.

Co-Authored-By: Julian Geiger <julian.geiger@psi.ch>
@agoscinski agoscinski enabled auto-merge (rebase) March 26, 2026 13:43
@agoscinski agoscinski merged commit 82d5ca7 into aiidateam:main Mar 26, 2026
16 checks passed
agoscinski added a commit that referenced this pull request Mar 26, 2026
)

The test_unload_profile test checks that SQLAlchemy's internal
_sessions WeakValueDictionary shrinks after unloading the profile.
This test was flaky for two reasons:

1. The backend close() method called scoped_session.close() but not
   scoped_session.remove(). close() releases the database connection
   and resets the session state, but the session object remains
   registered in the scoped_session's thread-local registry. Since
   _sessions is a WeakValueDictionary, entries are only removed when
   all strong references to the session are gone. The thread-local
   registry held a strong reference, preventing cleanup. Adding
   remove() explicitly clears the session from the thread-local
   registry, which drops the last strong reference and lets the
   WeakValueDictionary entry be collected immediately.

   This removes the gc.collect() workaround in close() that was
   previously needed to trigger cleanup of the weak reference. With
   remove(), the reference is dropped deterministically.

2. The test assumed that previous tests had already created at least
   one session (assert current_sessions != 0). When run in isolation,
   no sessions exist and the test fails at the first assertion. Fix
   by explicitly creating a session via get_session() before checking
   counts.

   The gc.collect() in the test is also removed. It was needed to
   clean up sessions leaked by earlier tests whose close() didn't
   call remove(), so sessions lingered in the WeakValueDictionary
   until GC collected them. Now that close() calls remove(),
   sessions from previous tests are already gone deterministically.
agoscinski added a commit that referenced this pull request Mar 26, 2026
…SqlaJoiner (#7281)

When `StorageBackend.default_user` is accessed, it creates a `QueryBuilder`
which internally creates a `SqlaQueryBuilder` and a `SqlaJoiner`. These two
objects formed a reference cycle:

  SqlaQueryBuilder._joiner → SqlaJoiner
  SqlaJoiner._entities     → SqlaQueryBuilder
  SqlaJoiner._build_filters → bound method → SqlaQueryBuilder

Because of this cycle, when the QueryBuilder went out of scope (e.g. after
`default_user` returned), CPython's reference counting could not reclaim
the objects immediately — they had to wait for the cyclic garbage collector.

The problem is that SqlaQueryBuilder._query_cache holds a BuiltQuery, which
contains an SQLAlchemy Query object that holds a strong reference to the
Session. So the ownership chain was:

  SqlaQueryBuilder (prevented from collection by cycle)
    → _query_cache → BuiltQuery → Query → Session

This meant the Session object stayed alive after `storage.close()` called
`session_factory.remove()`, because the weak reference in SQLAlchemy's
`_sessions` WeakValueDictionary still had a live referent. The session
would only be cleaned up when the cyclic GC eventually ran, making the
`test_unload_profile` test flaky depending on test ordering (it failed
when `test_default_user` ran first and populated the `_default_user` cache).

The fix breaks the cycle by:

1. Using `weakref.proxy(entity_mapper)` in SqlaJoiner instead of a strong
   reference. The joiner only uses entity_mapper for class-level attribute
   access (Node, User, Link, etc.) and calling build_filters(), so a proxy
   is safe — the SqlaQueryBuilder always outlives its SqlaJoiner.

2. Removing the separate `_build_filters` callback (which was a bound
   method creating a second strong back-reference). SqlaJoiner now calls
   `self._entities.build_filters()` through the weakref proxy instead.
@agoscinski agoscinski deleted the fix-weakref branch March 26, 2026 13:46
danielhollas pushed a commit to danielhollas/aiida-core that referenced this pull request Mar 27, 2026
…idateam#7281)

The test_unload_profile test checks that SQLAlchemy's internal
_sessions WeakValueDictionary shrinks after unloading the profile.
This test was flaky for two reasons:

1. The backend close() method called scoped_session.close() but not
   scoped_session.remove(). close() releases the database connection
   and resets the session state, but the session object remains
   registered in the scoped_session's thread-local registry. Since
   _sessions is a WeakValueDictionary, entries are only removed when
   all strong references to the session are gone. The thread-local
   registry held a strong reference, preventing cleanup. Adding
   remove() explicitly clears the session from the thread-local
   registry, which drops the last strong reference and lets the
   WeakValueDictionary entry be collected immediately.

   This removes the gc.collect() workaround in close() that was
   previously needed to trigger cleanup of the weak reference. With
   remove(), the reference is dropped deterministically.

2. The test assumed that previous tests had already created at least
   one session (assert current_sessions != 0). When run in isolation,
   no sessions exist and the test fails at the first assertion. Fix
   by explicitly creating a session via get_session() before checking
   counts.

   The gc.collect() in the test is also removed. It was needed to
   clean up sessions leaked by earlier tests whose close() didn't
   call remove(), so sessions lingered in the WeakValueDictionary
   until GC collected them. Now that close() calls remove(),
   sessions from previous tests are already gone deterministically.
danielhollas pushed a commit to danielhollas/aiida-core that referenced this pull request Mar 27, 2026
…SqlaJoiner (aiidateam#7281)

When `StorageBackend.default_user` is accessed, it creates a `QueryBuilder`
which internally creates a `SqlaQueryBuilder` and a `SqlaJoiner`. These two
objects formed a reference cycle:

  SqlaQueryBuilder._joiner → SqlaJoiner
  SqlaJoiner._entities     → SqlaQueryBuilder
  SqlaJoiner._build_filters → bound method → SqlaQueryBuilder

Because of this cycle, when the QueryBuilder went out of scope (e.g. after
`default_user` returned), CPython's reference counting could not reclaim
the objects immediately — they had to wait for the cyclic garbage collector.

The problem is that SqlaQueryBuilder._query_cache holds a BuiltQuery, which
contains an SQLAlchemy Query object that holds a strong reference to the
Session. So the ownership chain was:

  SqlaQueryBuilder (prevented from collection by cycle)
    → _query_cache → BuiltQuery → Query → Session

This meant the Session object stayed alive after `storage.close()` called
`session_factory.remove()`, because the weak reference in SQLAlchemy's
`_sessions` WeakValueDictionary still had a live referent. The session
would only be cleaned up when the cyclic GC eventually ran, making the
`test_unload_profile` test flaky depending on test ordering (it failed
when `test_default_user` ran first and populated the `_default_user` cache).

The fix breaks the cycle by:

1. Using `weakref.proxy(entity_mapper)` in SqlaJoiner instead of a strong
   reference. The joiner only uses entity_mapper for class-level attribute
   access (Node, User, Link, etc.) and calling build_filters(), so a proxy
   is safe — the SqlaQueryBuilder always outlives its SqlaJoiner.

2. Removing the separate `_build_filters` callback (which was a bound
   method creating a second strong back-reference). SqlaJoiner now calls
   `self._entities.build_filters()` through the weakref proxy instead.
danielhollas pushed a commit to danielhollas/aiida-core that referenced this pull request Mar 27, 2026
The mock existed because importing matplotlib kept an extra reference
alive in sqlalchemy's _sessions weakref dict, causing test_unload_profile
to fail. With deterministic session cleanup (remove() + broken reference
cycle), this workaround is no longer needed.

Co-Authored-By: Julian Geiger <julian.geiger@psi.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants