Fix flaky test_unload_profile by properly removing scoped session#7281
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7281 +/- ##
==========================================
- Coverage 79.87% 79.86% -0.01%
==========================================
Files 566 566
Lines 43938 43936 -2
==========================================
- Hits 35091 35084 -7
- Misses 8847 8852 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…idateam#7281) The test_unload_profile test checks that SQLAlchemy's internal _sessions WeakValueDictionary shrinks after unloading the profile. This test was flaky for two reasons: 1. The backend close() method called scoped_session.close() but not scoped_session.remove(). close() releases the database connection and resets the session state, but the session object remains registered in the scoped_session's thread-local registry. Since _sessions is a WeakValueDictionary, entries are only removed when all strong references to the session are gone. The thread-local registry held a strong reference, preventing cleanup. Adding remove() explicitly clears the session from the thread-local registry, which drops the last strong reference and lets the WeakValueDictionary entry be collected immediately. This removes the gc.collect() workaround in close() that was previously needed to trigger cleanup of the weak reference. With remove(), the reference is dropped deterministically. 2. The test assumed that previous tests had already created at least one session (assert current_sessions != 0). When run in isolation, no sessions exist and the test fails at the first assertion. Fix by explicitly creating a session via get_session() before checking counts. The gc.collect() in the test is also removed. It was needed to clean up sessions leaked by earlier tests whose close() didn't call remove(), so sessions lingered in the WeakValueDictionary until GC collected them. Now that close() calls remove(), sessions from previous tests are already gone deterministically.
…SqlaJoiner (aiidateam#7281) When `StorageBackend.default_user` is accessed, it creates a `QueryBuilder` which internally creates a `SqlaQueryBuilder` and a `SqlaJoiner`. These two objects formed a reference cycle: SqlaQueryBuilder._joiner → SqlaJoiner SqlaJoiner._entities → SqlaQueryBuilder SqlaJoiner._build_filters → bound method → SqlaQueryBuilder Because of this cycle, when the QueryBuilder went out of scope (e.g. after `default_user` returned), CPython's reference counting could not reclaim the objects immediately — they had to wait for the cyclic garbage collector. The problem is that SqlaQueryBuilder._query_cache holds a BuiltQuery, which contains an SQLAlchemy Query object that holds a strong reference to the Session. So the ownership chain was: SqlaQueryBuilder (prevented from collection by cycle) → _query_cache → BuiltQuery → Query → Session This meant the Session object stayed alive after `storage.close()` called `session_factory.remove()`, because the weak reference in SQLAlchemy's `_sessions` WeakValueDictionary still had a live referent. The session would only be cleaned up when the cyclic GC eventually ran, making the `test_unload_profile` test flaky depending on test ordering (it failed when `test_default_user` ran first and populated the `_default_user` cache). The fix breaks the cycle by: 1. Using `weakref.proxy(entity_mapper)` in SqlaJoiner instead of a strong reference. The joiner only uses entity_mapper for class-level attribute access (Node, User, Link, etc.) and calling build_filters(), so a proxy is safe — the SqlaQueryBuilder always outlives its SqlaJoiner. 2. Removing the separate `_build_filters` callback (which was a bound method creating a second strong back-reference). SqlaJoiner now calls `self._entities.build_filters()` through the weakref proxy instead.
|
Hm, with the fixes of the PR in place, I believe the aiida-core/tests/cmdline/commands/test_data.py Lines 580 to 592 in d17d09e This mock existed because importing matplotlib kept an extra reference alive in |
GeigerJ2
left a comment
There was a problem hiding this comment.
Wow, what a journey this PR review was. Great work, thanks, @agoscinski! Just one minor nitpick, but overall, let's get this baby in 🚀
danielhollas
left a comment
There was a problem hiding this comment.
Amazing, who knew we had gc.collect so deep in the storage innards. I feel like I should block some time to study this PR more in depth 😁
…idateam#7281) The test_unload_profile test checks that SQLAlchemy's internal _sessions WeakValueDictionary shrinks after unloading the profile. This test was flaky for two reasons: 1. The backend close() method called scoped_session.close() but not scoped_session.remove(). close() releases the database connection and resets the session state, but the session object remains registered in the scoped_session's thread-local registry. Since _sessions is a WeakValueDictionary, entries are only removed when all strong references to the session are gone. The thread-local registry held a strong reference, preventing cleanup. Adding remove() explicitly clears the session from the thread-local registry, which drops the last strong reference and lets the WeakValueDictionary entry be collected immediately. This removes the gc.collect() workaround in close() that was previously needed to trigger cleanup of the weak reference. With remove(), the reference is dropped deterministically. 2. The test assumed that previous tests had already created at least one session (assert current_sessions != 0). When run in isolation, no sessions exist and the test fails at the first assertion. Fix by explicitly creating a session via get_session() before checking counts. The gc.collect() in the test is also removed. It was needed to clean up sessions leaked by earlier tests whose close() didn't call remove(), so sessions lingered in the WeakValueDictionary until GC collected them. Now that close() calls remove(), sessions from previous tests are already gone deterministically.
…SqlaJoiner (aiidateam#7281) When `StorageBackend.default_user` is accessed, it creates a `QueryBuilder` which internally creates a `SqlaQueryBuilder` and a `SqlaJoiner`. These two objects formed a reference cycle: SqlaQueryBuilder._joiner → SqlaJoiner SqlaJoiner._entities → SqlaQueryBuilder SqlaJoiner._build_filters → bound method → SqlaQueryBuilder Because of this cycle, when the QueryBuilder went out of scope (e.g. after `default_user` returned), CPython's reference counting could not reclaim the objects immediately — they had to wait for the cyclic garbage collector. The problem is that SqlaQueryBuilder._query_cache holds a BuiltQuery, which contains an SQLAlchemy Query object that holds a strong reference to the Session. So the ownership chain was: SqlaQueryBuilder (prevented from collection by cycle) → _query_cache → BuiltQuery → Query → Session This meant the Session object stayed alive after `storage.close()` called `session_factory.remove()`, because the weak reference in SQLAlchemy's `_sessions` WeakValueDictionary still had a live referent. The session would only be cleaned up when the cyclic GC eventually ran, making the `test_unload_profile` test flaky depending on test ordering (it failed when `test_default_user` ran first and populated the `_default_user` cache). The fix breaks the cycle by: 1. Using `weakref.proxy(entity_mapper)` in SqlaJoiner instead of a strong reference. The joiner only uses entity_mapper for class-level attribute access (Node, User, Link, etc.) and calling build_filters(), so a proxy is safe — the SqlaQueryBuilder always outlives its SqlaJoiner. 2. Removing the separate `_build_filters` callback (which was a bound method creating a second strong back-reference). SqlaJoiner now calls `self._entities.build_filters()` through the weakref proxy instead.
The mock existed because importing matplotlib kept an extra reference alive in sqlalchemy's _sessions weakref dict, causing test_unload_profile to fail. With deterministic session cleanup (remove() + broken reference cycle), this workaround is no longer needed. Co-Authored-By: Julian Geiger <julian.geiger@psi.ch>
) The test_unload_profile test checks that SQLAlchemy's internal _sessions WeakValueDictionary shrinks after unloading the profile. This test was flaky for two reasons: 1. The backend close() method called scoped_session.close() but not scoped_session.remove(). close() releases the database connection and resets the session state, but the session object remains registered in the scoped_session's thread-local registry. Since _sessions is a WeakValueDictionary, entries are only removed when all strong references to the session are gone. The thread-local registry held a strong reference, preventing cleanup. Adding remove() explicitly clears the session from the thread-local registry, which drops the last strong reference and lets the WeakValueDictionary entry be collected immediately. This removes the gc.collect() workaround in close() that was previously needed to trigger cleanup of the weak reference. With remove(), the reference is dropped deterministically. 2. The test assumed that previous tests had already created at least one session (assert current_sessions != 0). When run in isolation, no sessions exist and the test fails at the first assertion. Fix by explicitly creating a session via get_session() before checking counts. The gc.collect() in the test is also removed. It was needed to clean up sessions leaked by earlier tests whose close() didn't call remove(), so sessions lingered in the WeakValueDictionary until GC collected them. Now that close() calls remove(), sessions from previous tests are already gone deterministically.
…SqlaJoiner (#7281) When `StorageBackend.default_user` is accessed, it creates a `QueryBuilder` which internally creates a `SqlaQueryBuilder` and a `SqlaJoiner`. These two objects formed a reference cycle: SqlaQueryBuilder._joiner → SqlaJoiner SqlaJoiner._entities → SqlaQueryBuilder SqlaJoiner._build_filters → bound method → SqlaQueryBuilder Because of this cycle, when the QueryBuilder went out of scope (e.g. after `default_user` returned), CPython's reference counting could not reclaim the objects immediately — they had to wait for the cyclic garbage collector. The problem is that SqlaQueryBuilder._query_cache holds a BuiltQuery, which contains an SQLAlchemy Query object that holds a strong reference to the Session. So the ownership chain was: SqlaQueryBuilder (prevented from collection by cycle) → _query_cache → BuiltQuery → Query → Session This meant the Session object stayed alive after `storage.close()` called `session_factory.remove()`, because the weak reference in SQLAlchemy's `_sessions` WeakValueDictionary still had a live referent. The session would only be cleaned up when the cyclic GC eventually ran, making the `test_unload_profile` test flaky depending on test ordering (it failed when `test_default_user` ran first and populated the `_default_user` cache). The fix breaks the cycle by: 1. Using `weakref.proxy(entity_mapper)` in SqlaJoiner instead of a strong reference. The joiner only uses entity_mapper for class-level attribute access (Node, User, Link, etc.) and calling build_filters(), so a proxy is safe — the SqlaQueryBuilder always outlives its SqlaJoiner. 2. Removing the separate `_build_filters` callback (which was a bound method creating a second strong back-reference). SqlaJoiner now calls `self._entities.build_filters()` through the weakref proxy instead.
…idateam#7281) The test_unload_profile test checks that SQLAlchemy's internal _sessions WeakValueDictionary shrinks after unloading the profile. This test was flaky for two reasons: 1. The backend close() method called scoped_session.close() but not scoped_session.remove(). close() releases the database connection and resets the session state, but the session object remains registered in the scoped_session's thread-local registry. Since _sessions is a WeakValueDictionary, entries are only removed when all strong references to the session are gone. The thread-local registry held a strong reference, preventing cleanup. Adding remove() explicitly clears the session from the thread-local registry, which drops the last strong reference and lets the WeakValueDictionary entry be collected immediately. This removes the gc.collect() workaround in close() that was previously needed to trigger cleanup of the weak reference. With remove(), the reference is dropped deterministically. 2. The test assumed that previous tests had already created at least one session (assert current_sessions != 0). When run in isolation, no sessions exist and the test fails at the first assertion. Fix by explicitly creating a session via get_session() before checking counts. The gc.collect() in the test is also removed. It was needed to clean up sessions leaked by earlier tests whose close() didn't call remove(), so sessions lingered in the WeakValueDictionary until GC collected them. Now that close() calls remove(), sessions from previous tests are already gone deterministically.
…SqlaJoiner (aiidateam#7281) When `StorageBackend.default_user` is accessed, it creates a `QueryBuilder` which internally creates a `SqlaQueryBuilder` and a `SqlaJoiner`. These two objects formed a reference cycle: SqlaQueryBuilder._joiner → SqlaJoiner SqlaJoiner._entities → SqlaQueryBuilder SqlaJoiner._build_filters → bound method → SqlaQueryBuilder Because of this cycle, when the QueryBuilder went out of scope (e.g. after `default_user` returned), CPython's reference counting could not reclaim the objects immediately — they had to wait for the cyclic garbage collector. The problem is that SqlaQueryBuilder._query_cache holds a BuiltQuery, which contains an SQLAlchemy Query object that holds a strong reference to the Session. So the ownership chain was: SqlaQueryBuilder (prevented from collection by cycle) → _query_cache → BuiltQuery → Query → Session This meant the Session object stayed alive after `storage.close()` called `session_factory.remove()`, because the weak reference in SQLAlchemy's `_sessions` WeakValueDictionary still had a live referent. The session would only be cleaned up when the cyclic GC eventually ran, making the `test_unload_profile` test flaky depending on test ordering (it failed when `test_default_user` ran first and populated the `_default_user` cache). The fix breaks the cycle by: 1. Using `weakref.proxy(entity_mapper)` in SqlaJoiner instead of a strong reference. The joiner only uses entity_mapper for class-level attribute access (Node, User, Link, etc.) and calling build_filters(), so a proxy is safe — the SqlaQueryBuilder always outlives its SqlaJoiner. 2. Removing the separate `_build_filters` callback (which was a bound method creating a second strong back-reference). SqlaJoiner now calls `self._entities.build_filters()` through the weakref proxy instead.
The mock existed because importing matplotlib kept an extra reference alive in sqlalchemy's _sessions weakref dict, causing test_unload_profile to fail. With deterministic session cleanup (remove() + broken reference cycle), this workaround is no longer needed. Co-Authored-By: Julian Geiger <julian.geiger@psi.ch>
Fixes flaky test, e.g. in https://github.com/agoscinski/aiida-core/actions/runs/22924322496/job/66530719671
Commit 1: Missing remove in storage backend
The test_unload_profile test checks that SQLAlchemy's internal _sessions WeakValueDictionary shrinks after unloading the profile. This test was flaky for two reasons:
The backend close() method called scoped_session.close() but not scoped_session.remove(). close() releases the database connection and resets the session state, but the session object remains registered in the scoped_session's thread-local registry. Since _sessions is a WeakValueDictionary, entries are only removed when all strong references to the session are gone. The thread-local registry held a strong reference, preventing cleanup. Adding remove() explicitly clears the session from the thread-local registry, which drops the last strong reference and lets the WeakValueDictionary entry be collected immediately.
This removes the gc.collect() workaround in close() that was previously needed to trigger cleanup of the weak reference. With remove(), the reference is dropped deterministically.
The test assumed that previous tests had already created at least one session (assert current_sessions != 0). When run in isolation, no sessions exist and the test fails at the first assertion. Fix by explicitly creating a session via get_session() before checking counts.
The gc.collect() in the test is also removed. It was needed to clean up sessions leaked by earlier tests whose close() didn't call remove(), so sessions lingered in the WeakValueDictionary until GC collected them. Now that close() calls remove(), sessions from previous tests are already gone deterministically.
Script reproducing the error, showing that remove really removes it
Commit 2: Cyclic ownership between SqlaQueryBuilder and SqlaJoiner
Read second commit message.
Commit 3: Remove unnecessary monkeypatching of the trajectory
Covers what is described in the comment #7281 (comment)