🐛 aiida_localhost: suffix label with xdist-worker id#7363
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7363 +/- ##
==========================================
- Coverage 80.28% 80.28% -0.00%
==========================================
Files 577 577
Lines 45543 45545 +2
==========================================
+ Hits 36560 36561 +1
- Misses 8983 8984 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
cab2415 to
4c514a4
Compare
268b5a2 to
f270550
Compare
| @pytest.fixture(autouse=True) | ||
| def _setup(self, aiida_localhost): | ||
| """Setup for methods.""" |
There was a problem hiding this comment.
Switched from a unittest-style setup_method to a pytest autouse fixture because we need to inject aiida_localhost. setup_method can't receive fixtures as parameters, and the previous body was creating a literal-'localhost' Computer directly via Computer.collection.get_or_create(...), which is exactly the kind of fixture-bypass the rest of this PR is auditing away. With the worker-suffixed aiida_localhost, that direct creation would leave a stray literal-'localhost' row alongside the fixture's localhost-{worker_id} row.
Renamed to _setup (rather than keeping setup_method) because setup_method is pytest's xunit-style class setup hook — it's invoked directly with signature (self, method). If we kept that name and added @pytest.fixture(autouse=True), pytest's xunit dispatch would still try to call it as setup_method(self, method), which mismatches our (self, aiida_localhost) signature. The leading underscore is purely a convention to flag it as an internal autouse fixture, not meant to be referenced by name.
| label='dummy_code', | ||
| default_calc_job_plugin='core.stash', | ||
| computer=orm.load_computer(label='localhost'), | ||
| computer=aiida_localhost, |
There was a problem hiding this comment.
As a note here, also relevant to the other changes to tests here:
The audit was deliberately scoped to test code only. Production sites that create or load a literal-'localhost' Computer are intentionally untouched: src/aiida/cmdline/commands/cmd_presto.py, src/aiida/cmdline/commands/cmd_devel.py (prepare_localhost), src/aiida/calculations/stash.py, and src/aiida/calculations/unstash.py all stay as-is. Those define the user-facing contract that "the localhost computer's label is 'localhost'", which is what users see after running verdi presto. Routing them through aiida_localhost would either change that contract or require a fixture-aware code path in production, neither of which we want. The fix here lives entirely in the test fixture so production behaviour is unchanged.
(Side note unrelated to this PR: cmd_devel.py:271 has a stale docstring referencing aiida.engine.launch_shell_job, which doesn't exist in the codebase — prepare_localhost is the actual function. Worth a one-line follow-up but out of scope here.)
| @@ -114,10 +123,23 @@ def _build() -> 'Computer': | |||
| try: | |||
| computer.store() | |||
| except IntegrityError: | |||
There was a problem hiding this comment.
This is still literal Claude output that I have not verified, but it seems sensible:
While putting this fix together I traced the test fixture wiring end-to-end to verify the framing in #7347 (and in @agoscinski's #7349). Two things worth knowing:
-
Workers don't share databases in the test setup.
aiida_configusestmp_path_factory.mktemp(secrets.token_hex(16)); under xdist each worker has its ownbasetemp(<basetemp>/popen-<worker_id>, set byxdist/workermanage.py:339-341) and its own per-process random state, so each worker's SQLite file lives at a unique path. For PostgreSQL,postgres_clusteris session-scoped and usespgtest.PGTest(), which spawns a fresh ephemeral cluster per worker on a random port. The CI'sservices: postgresis not what tests connect to —pgtestspawns its own. -
So the "another xdist worker resets the shared DB" mechanism in
aiida_profile_cleancauses flaky test failures under pytest-xdist #7347's body and in 🐛 HandleIntegrityErrorinaiida_computerfixture #7349's commit message can't be the literal cause of theIntegrityErrorwe observe. The trace inaiida_profile_cleancauses flaky test failures under pytest-xdist #7347 shows the existing row's scheduler/transport matched the fixture's defaults exactly, so the four-fieldComputer.collection.getshould have hit it but raisedNotExistentanyway. Within a single sequential worker process, the only mechanism that fits is a SQLAlchemy session-cache stale read across two adjacent statements (or some weird in-flight transaction state). The label suffix in this PR sidesteps the question entirely by making the fixture's row impossible to collide with.
I kept the IntegrityError catch and its NotExistent fallback (originally added by @agoscinski in #7349) as defensive belt-and-braces — the suffix should make this branch unreachable from aiida_localhost, but direct callers of the lower-level aiida_computer factory could still hit it. Comment in orm.py documents the no-retry-loop reasoning and why a lock or INSERT ... ON CONFLICT would not be a better fit here.
This PR addresses Category 1 of #7347 only (UNIQUE collisions). Category 2 (worker crashes from reset_storage() mid-test) is a separate concern that needs per-worker profile isolation rather than a label fix; left for follow-up.
f270550 to
c2a6590
Compare
Fixes the `UNIQUE constraint failed: db_dbcomputer.label` flake on `tests-presto` and friends (aiidateam#7347, Category 1). The `aiida_localhost` fixture's row and a literal-`'localhost'` row created by another code path (notably `verdi presto`) ended up in the same profile, and one of two adjacent statements in the fixture's get-or-create — the four- field `Computer.collection.get` followed by `Computer(...).store()` — saw inconsistent state and tripped the single-column UNIQUE on `Computer.label`. Tracing didn't pin the exact within-worker mechanism (likely a SQLAlchemy session-cache stale read after `aiida_profile_- clean` reset state); xdist surfaces it more often by widening the test ordering space. Fix: suffix the fixture's label with the pytest-xdist worker id (`localhost-gw0`, ..., or `localhost-master` outside xdist). The fixture row and any literal-label row coexist under different names, so UNIQUE cannot fire regardless of the timing or cache state. The literal `'localhost'` label remains the production contract for `verdi presto` and is untouched. `'master'` matches xdist's own `worker_id` fixture convention. Eight tests asserted against the literal `'localhost'` form and broke; they are parameterized on `aiida_localhost.label` so the comparison is exact and stable across workers. `test_code.py`'s `_normalize_code_show_output` strips the worker-id suffix from the fixture's label before the hostname substitution (the suffixed label contains the hostname as a substring), so the four affected golden files keep the user-facing literal `localhost`. The three graphviz tests in `test_graph.py` build their expected strings off `self.computer.label`. Audit other test sites that bypassed the fixture and created or loaded a literal-`'localhost'` Computer directly, switching them to depend on `aiida_localhost`: `tests/calculations/test_stash.py`, `tests/orm/nodes/data/test_remote.py`'s factory, and the `TestNode` class in `tests/orm/nodes/test_node.py` (`setup_method` becomes a pytest autouse fixture). Literal-`'localhost'` sites in `test_presto.py` and the fixture's own non-collision regression test are intentional. Production code paths that own the literal- `'localhost'` contract (`verdi presto`, `prepare_localhost`, `stash`/`unstash`) are deliberately untouched. Drive-by: document `aiida_computer`'s `IntegrityError` recovery — why catch-and-retry-`get()` rather than loop-with-backoff, why a defensive `NotExistent` fallback is kept, and why a user-space lock or `INSERT ... ON CONFLICT` would not be a better fit here. Addresses Category 1 of aiidateam#7347 only. Category 2 (worker crashes from `reset_storage()` mid-test) needs per-worker profile isolation, not a label fix, and is left for follow-up.
edan-bainglass
left a comment
There was a problem hiding this comment.
Thanks @GeigerJ2. Tested this on #6990, which had failed a few tests with IntegrityError. Those tests now pass. As stated elsewhere, due to the flaky nature of these tests, we can't with absolute certainty say that this fixed it, but at least from what I can see, the solution here is quite sensible and in theory should resolve the problem consistently.
One comment is perhaps before merging, have a pass over the added inline comments and docs. I found them a bit convoluted. Great if we can simplify the language, or at least try to aim them at our future selves (or AiiDA's future developers). This would be helpful 🙏
Of course, also triple-check anything authored by your agent if you haven't already.
|
Skimmed through it. Really complex issue. Is it due to some SQLAlchemy cache that is shared between different xdist workers? I am missing in this long explanations the resource that the xdist workers shared. I though we are acting on different databases for each worker so this should not happen? And if we are not acting on different databases for each worker, wouldn't the easier fix to just fix this on the database level and not on the computer level, so it cannot occur for anything else? I am fine to remove the previous fix, it was only a tmp fix because we did not know the exact cause. If the reason for the bug is that we actually share the database between workers then this PR should be the real fix and the previous tmp fix can be removed. If we share the database between workers, we should fix this instead. If it is a weird SQLAlchemy shares a cache creating a shared state behind the scene without letting us know, then I am fine with the currently implemented fix. |
|
@agoscinski thanks for pushing on this. I (Claude ^^) went deep into the tl;dr: workers don't share databases (see below), so the cause is somewhere within a single worker process. My best guess is The chain, end to end:
I confirmed this empirically — see the snippet at the bottom of this comment. Drop it into
Distinct profile names, distinct filepaths, all under their own That leaves a within-worker mechanism, and this is where I get stuck. For the trace to make sense (the I also went through My best-guess ranking of plausible culprits, none of which I can pin to a specific line of code:
The label-suffix is still the right fix at the test-fixture level regardless of which of these is firing. It makes the conflict structurally impossible by construction. To your "fix it at the DB level so it can't occur for anything else": I think that's a reasonable instinct, but it's worth noting that "anything else" is in practice an empty set here. The only fixture in the suite that uses a literal label is The label-suffix is cheaper than all of those: doesn't touch production code, doesn't touch the schema, doesn't introduce dialect-specific SQL, makes the conflict structurally impossible by ensuring the fixture's row and any literal- On whether to remove #7349: I'd keep it (i.e. not revert it in this PR) as defence-in-depth. The
|
commit msg: