Skip to content

🐛 aiida_localhost: suffix label with xdist-worker id#7363

Merged
GeigerJ2 merged 1 commit into
aiidateam:mainfrom
GeigerJ2:fix/7347/aiida-localhost-fixture-isolation
May 11, 2026
Merged

🐛 aiida_localhost: suffix label with xdist-worker id#7363
GeigerJ2 merged 1 commit into
aiidateam:mainfrom
GeigerJ2:fix/7347/aiida-localhost-fixture-isolation

Conversation

@GeigerJ2
Copy link
Copy Markdown
Collaborator

@GeigerJ2 GeigerJ2 commented May 5, 2026

commit msg:

🐛 `aiida_localhost`: suffix label with xdist-worker id

Fixes the `UNIQUE constraint failed: db_dbcomputer.label` flake on
`tests-presto` and friends (#7347, Category 1). The `aiida_localhost`
fixture's row and a literal-`'localhost'` row created by another code
path (notably `verdi presto`) ended up in the same profile, and one
of two adjacent statements in the fixture's get-or-create — the four-
field `Computer.collection.get` followed by `Computer(...).store()` —
saw inconsistent state and tripped the single-column UNIQUE on
`Computer.label`. Tracing didn't pin the exact within-worker mechanism
(likely a SQLAlchemy session-cache stale read after `aiida_profile_-
clean` reset state); xdist surfaces it more often by widening the test
ordering space.

Fix: suffix the fixture's label with the pytest-xdist worker id
(`localhost-gw0`, ..., or `localhost-master` outside xdist). The
fixture row and any literal-label row coexist under different names,
so UNIQUE cannot fire regardless of the timing or cache state. The
literal `'localhost'` label remains the production contract for
`verdi presto` and is untouched. `'master'` matches xdist's own
`worker_id` fixture convention.

Eight tests asserted against the literal `'localhost'` form and broke;
they are parameterized on `aiida_localhost.label` so the comparison is
exact and stable across workers. `test_code.py`'s
`_normalize_code_show_output` strips the worker-id suffix from the
fixture's label before the hostname substitution (the suffixed label
contains the hostname as a substring), so the four affected golden
files keep the user-facing literal `localhost`. The three graphviz
tests in `test_graph.py` build their expected strings off
`self.computer.label`.

Audit other test sites that bypassed the fixture and created or loaded
a literal-`'localhost'` Computer directly, switching them to depend on
`aiida_localhost`: `tests/calculations/test_stash.py`,
`tests/orm/nodes/data/test_remote.py`'s factory, and the `TestNode`
class in `tests/orm/nodes/test_node.py` (`setup_method` becomes a
pytest autouse fixture). Literal-`'localhost'` sites in
`test_presto.py` and the fixture's own non-collision regression test
are intentional. Production code paths that own the literal-
`'localhost'` contract (`verdi presto`, `prepare_localhost`,
`stash`/`unstash`) are deliberately untouched.

Drive-by: document `aiida_computer`'s `IntegrityError` recovery —
why catch-and-retry-`get()` rather than loop-with-backoff, why a
defensive `NotExistent` fallback is kept, and why a user-space lock
or `INSERT ... ON CONFLICT` would not be a better fit here.

Addresses Category 1 of #7347 only. Category 2 (worker crashes from
`reset_storage()` mid-test) needs per-worker profile isolation, not a
label fix, and is left for follow-up.

@GeigerJ2 GeigerJ2 linked an issue May 5, 2026 that may be closed by this pull request
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.28%. Comparing base (6a6f66a) to head (c2a6590).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7363      +/-   ##
==========================================
- Coverage   80.28%   80.28%   -0.00%     
==========================================
  Files         577      577              
  Lines       45543    45545       +2     
==========================================
+ Hits        36560    36561       +1     
- Misses       8983     8984       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@GeigerJ2 GeigerJ2 force-pushed the fix/7347/aiida-localhost-fixture-isolation branch from cab2415 to 4c514a4 Compare May 6, 2026 13:15
@GeigerJ2 GeigerJ2 marked this pull request as ready for review May 6, 2026 15:27
@GeigerJ2 GeigerJ2 force-pushed the fix/7347/aiida-localhost-fixture-isolation branch 2 times, most recently from 268b5a2 to f270550 Compare May 7, 2026 11:47
Comment thread tests/orm/nodes/test_node.py Outdated
Comment on lines 27 to 29
@pytest.fixture(autouse=True)
def _setup(self, aiida_localhost):
"""Setup for methods."""
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched from a unittest-style setup_method to a pytest autouse fixture because we need to inject aiida_localhost. setup_method can't receive fixtures as parameters, and the previous body was creating a literal-'localhost' Computer directly via Computer.collection.get_or_create(...), which is exactly the kind of fixture-bypass the rest of this PR is auditing away. With the worker-suffixed aiida_localhost, that direct creation would leave a stray literal-'localhost' row alongside the fixture's localhost-{worker_id} row.

Renamed to _setup (rather than keeping setup_method) because setup_method is pytest's xunit-style class setup hook — it's invoked directly with signature (self, method). If we kept that name and added @pytest.fixture(autouse=True), pytest's xunit dispatch would still try to call it as setup_method(self, method), which mismatches our (self, aiida_localhost) signature. The leading underscore is purely a convention to flag it as an internal autouse fixture, not meant to be referenced by name.

label='dummy_code',
default_calc_job_plugin='core.stash',
computer=orm.load_computer(label='localhost'),
computer=aiida_localhost,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a note here, also relevant to the other changes to tests here:
The audit was deliberately scoped to test code only. Production sites that create or load a literal-'localhost' Computer are intentionally untouched: src/aiida/cmdline/commands/cmd_presto.py, src/aiida/cmdline/commands/cmd_devel.py (prepare_localhost), src/aiida/calculations/stash.py, and src/aiida/calculations/unstash.py all stay as-is. Those define the user-facing contract that "the localhost computer's label is 'localhost'", which is what users see after running verdi presto. Routing them through aiida_localhost would either change that contract or require a fixture-aware code path in production, neither of which we want. The fix here lives entirely in the test fixture so production behaviour is unchanged.

(Side note unrelated to this PR: cmd_devel.py:271 has a stale docstring referencing aiida.engine.launch_shell_job, which doesn't exist in the codebase — prepare_localhost is the actual function. Worth a one-line follow-up but out of scope here.)

@@ -114,10 +123,23 @@ def _build() -> 'Computer':
try:
computer.store()
except IntegrityError:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still literal Claude output that I have not verified, but it seems sensible:

While putting this fix together I traced the test fixture wiring end-to-end to verify the framing in #7347 (and in @agoscinski's #7349). Two things worth knowing:

  1. Workers don't share databases in the test setup. aiida_config uses tmp_path_factory.mktemp(secrets.token_hex(16)); under xdist each worker has its own basetemp (<basetemp>/popen-<worker_id>, set by xdist/workermanage.py:339-341) and its own per-process random state, so each worker's SQLite file lives at a unique path. For PostgreSQL, postgres_cluster is session-scoped and uses pgtest.PGTest(), which spawns a fresh ephemeral cluster per worker on a random port. The CI's services: postgres is not what tests connect to — pgtest spawns its own.

  2. So the "another xdist worker resets the shared DB" mechanism in aiida_profile_clean causes flaky test failures under pytest-xdist #7347's body and in 🐛 Handle IntegrityError in aiida_computer fixture #7349's commit message can't be the literal cause of the IntegrityError we observe. The trace in aiida_profile_clean causes flaky test failures under pytest-xdist #7347 shows the existing row's scheduler/transport matched the fixture's defaults exactly, so the four-field Computer.collection.get should have hit it but raised NotExistent anyway. Within a single sequential worker process, the only mechanism that fits is a SQLAlchemy session-cache stale read across two adjacent statements (or some weird in-flight transaction state). The label suffix in this PR sidesteps the question entirely by making the fixture's row impossible to collide with.

I kept the IntegrityError catch and its NotExistent fallback (originally added by @agoscinski in #7349) as defensive belt-and-braces — the suffix should make this branch unreachable from aiida_localhost, but direct callers of the lower-level aiida_computer factory could still hit it. Comment in orm.py documents the no-retry-loop reasoning and why a lock or INSERT ... ON CONFLICT would not be a better fit here.

This PR addresses Category 1 of #7347 only (UNIQUE collisions). Category 2 (worker crashes from reset_storage() mid-test) is a separate concern that needs per-worker profile isolation rather than a label fix; left for follow-up.

@GeigerJ2 GeigerJ2 force-pushed the fix/7347/aiida-localhost-fixture-isolation branch from f270550 to c2a6590 Compare May 7, 2026 19:52
Fixes the `UNIQUE constraint failed: db_dbcomputer.label` flake on
`tests-presto` and friends (aiidateam#7347, Category 1). The `aiida_localhost`
fixture's row and a literal-`'localhost'` row created by another code
path (notably `verdi presto`) ended up in the same profile, and one
of two adjacent statements in the fixture's get-or-create — the four-
field `Computer.collection.get` followed by `Computer(...).store()` —
saw inconsistent state and tripped the single-column UNIQUE on
`Computer.label`. Tracing didn't pin the exact within-worker mechanism
(likely a SQLAlchemy session-cache stale read after `aiida_profile_-
clean` reset state); xdist surfaces it more often by widening the test
ordering space.

Fix: suffix the fixture's label with the pytest-xdist worker id
(`localhost-gw0`, ..., or `localhost-master` outside xdist). The
fixture row and any literal-label row coexist under different names,
so UNIQUE cannot fire regardless of the timing or cache state. The
literal `'localhost'` label remains the production contract for
`verdi presto` and is untouched. `'master'` matches xdist's own
`worker_id` fixture convention.

Eight tests asserted against the literal `'localhost'` form and broke;
they are parameterized on `aiida_localhost.label` so the comparison is
exact and stable across workers. `test_code.py`'s
`_normalize_code_show_output` strips the worker-id suffix from the
fixture's label before the hostname substitution (the suffixed label
contains the hostname as a substring), so the four affected golden
files keep the user-facing literal `localhost`. The three graphviz
tests in `test_graph.py` build their expected strings off
`self.computer.label`.

Audit other test sites that bypassed the fixture and created or loaded
a literal-`'localhost'` Computer directly, switching them to depend on
`aiida_localhost`: `tests/calculations/test_stash.py`,
`tests/orm/nodes/data/test_remote.py`'s factory, and the `TestNode`
class in `tests/orm/nodes/test_node.py` (`setup_method` becomes a
pytest autouse fixture). Literal-`'localhost'` sites in
`test_presto.py` and the fixture's own non-collision regression test
are intentional. Production code paths that own the literal-
`'localhost'` contract (`verdi presto`, `prepare_localhost`,
`stash`/`unstash`) are deliberately untouched.

Drive-by: document `aiida_computer`'s `IntegrityError` recovery —
why catch-and-retry-`get()` rather than loop-with-backoff, why a
defensive `NotExistent` fallback is kept, and why a user-space lock
or `INSERT ... ON CONFLICT` would not be a better fit here.

Addresses Category 1 of aiidateam#7347 only. Category 2 (worker crashes from
`reset_storage()` mid-test) needs per-worker profile isolation, not a
label fix, and is left for follow-up.
@GeigerJ2 GeigerJ2 requested a review from edan-bainglass May 7, 2026 19:53
Copy link
Copy Markdown
Member

@edan-bainglass edan-bainglass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GeigerJ2. Tested this on #6990, which had failed a few tests with IntegrityError. Those tests now pass. As stated elsewhere, due to the flaky nature of these tests, we can't with absolute certainty say that this fixed it, but at least from what I can see, the solution here is quite sensible and in theory should resolve the problem consistently.

One comment is perhaps before merging, have a pass over the added inline comments and docs. I found them a bit convoluted. Great if we can simplify the language, or at least try to aim them at our future selves (or AiiDA's future developers). This would be helpful 🙏

Of course, also triple-check anything authored by your agent if you haven't already.

@agoscinski
Copy link
Copy Markdown
Collaborator

agoscinski commented May 8, 2026

Skimmed through it. Really complex issue. Is it due to some SQLAlchemy cache that is shared between different xdist workers? I am missing in this long explanations the resource that the xdist workers shared. I though we are acting on different databases for each worker so this should not happen? And if we are not acting on different databases for each worker, wouldn't the easier fix to just fix this on the database level and not on the computer level, so it cannot occur for anything else?

I am fine to remove the previous fix, it was only a tmp fix because we did not know the exact cause. If the reason for the bug is that we actually share the database between workers then this PR should be the real fix and the previous tmp fix can be removed. If we share the database between workers, we should fix this instead. If it is a weird SQLAlchemy shares a cache creating a shared state behind the scene without letting us know, then I am fine with the currently implemented fix.

@GeigerJ2
Copy link
Copy Markdown
Collaborator Author

GeigerJ2 commented May 11, 2026

@agoscinski thanks for pushing on this. I (Claude ^^) went deep into the SQLAlchemy and pytest-xdist sources to try to actually nail the mechanism, and the honest answer is I (Claude ^^) can't pin it down from static analysis alone. Let me lay out what I'm now confident about vs. what I had to give up on.

tl;dr: workers don't share databases (see below), so the cause is somewhere within a single worker process. My best guess is SQLite/fsync timing on busy CI runners or a session-state edge case in the _clear() + reset_profile() + lazy-reinit boundary, but static analysis didn't pin it. The label-suffix avoids the question rather than answering it, but I think that's acceptable because a deterministic DB-level fix wouldn't give us materially more than what we already have.

The chain, end to end:

  • pytest-xdist's workermanage.py:337-341 launches each worker with basetemp = <main_basetemp>/popen-<gateway.id> (so popen-gw0, popen-gw1, ...). Each worker is a separate Python process running its own pytest session.
  • aiida-core's tests/conftest.py:279 overrides the upstream aiida_profile fixture, building storage_config via config_sqlite_dos() (aiida/tools/pytest_fixtures/storage.py:120-121), which just returns {'filepath': str(tmp_path_factory.mktemp('test_sqlite_dos_storage'))}.
  • tmp_path_factory is itself rooted at the per-worker basetemp set by xdist, so each worker's mktemp(...) returns a directory under popen-<gateway.id>/.
  • The session-scoped aiida_profile and aiida_config fixtures fire once per worker process (each worker has its own session), so each worker generates its own profile name and its own storage dir.

I confirmed this empirically — see the snippet at the bottom of this comment. Drop it into tests/ and run with -n 4 --db-backend sqlite. The test body is parametrized 4 times so xdist hands one item to each of the 4 workers; each invocation appends a line to /tmp/aiida_xdist_demo.log (a file, because xdist captures stdout per worker by default, even with -s). After the run, sort /tmp/aiida_xdist_demo.log shows one row per worker, like so:

worker pid profile_name (random hex) storage filepath
gw0 410608 01f3437cc7de7f9903e2a501f8b3a891 /tmp/pytest-of-geiger_j/pytest-24/popen-gw0/test_sqlite_dos_storage0
gw1 410611 8ada8cf42f2ebfc71caa7384adfcb310 /tmp/pytest-of-geiger_j/pytest-24/popen-gw1/test_sqlite_dos_storage0
gw2 410617 46ad443996ced887666d42b3eb7ff999 /tmp/pytest-of-geiger_j/pytest-24/popen-gw2/test_sqlite_dos_storage0
gw3 410620 5da390af9972fd6bb6615b3c52e258d1 /tmp/pytest-of-geiger_j/pytest-24/popen-gw3/test_sqlite_dos_storage0

Distinct profile names, distinct filepaths, all under their own popen-gwN/. For PostgreSQL the postgres_cluster fixture spawns a fresh pgtest.PGTest() per worker on a random port, so the CI's services: postgres is not what tests connect to, it's a per-worker ephemeral cluster. The failing trace in #7347 is consistent with this: workdir is /tmp/pytest-of-runner/pytest-0/popen-gw0/test_get_by_id0, i.e. gw0's own basetemp. So the "all xdist workers share the same database" framing in the issue body is wrong, and the multi-worker TOCTOU mechanism in #7349's commit message can't be the literal cause.

That leaves a within-worker mechanism, and this is where I get stuck. For the trace to make sense (the get(label='localhost', hostname='localhost', scheduler='core.direct', transport='core.local') raises NotExistent, then the immediately-following Computer(...).store() fails UNIQUE on label), there has to be a row with label='localhost' at INSERT time but not at SELECT time. Within a single sequential pytest worker process there's no concurrency in user-space code. I checked all the literal-'localhost' Computer creators in the suite (test_node.py::TestNode::setup_method pre-PR, test_remote.py's factory pre-PR, verdi presto, prepare_localhost in cmd_devel.py) and they all use the exact same scheduler/transport defaults, so if the row exists at INSERT, the four-field get() should also hit it. There's no filter-mismatch path I can find that would explain the inconsistency.

I also went through SQLAlchemy's session model carefully to rule out the obvious cache-state suspects. Session.close() doesn't actually remove the session from the scoped_session registry (session.py:2546-2553 is explicit: "the Session itself does not actually have a distinct closed state"), so the close-then-expunge sequence in _clear() is fine. _clear() does DELETE FROM <table> per reflected table, not DROP TABLE (migrator.py:325-346), so the schema stays intact and aiida_profile_clean leaves a clean empty schema. scoped_session is thread-local via ThreadLocalRegistry (util/_collections.py:630-645), but unless something is spawning threads I'm not seeing, that shouldn't matter for these tests. No smoking gun in the session lifecycle.

My best-guess ranking of plausible culprits, none of which I can pin to a specific line of code:

  1. SQLite write-visibility timing under busy CI I/O. GitHub runners' shared disks have unpredictable fsync latencies. A SELECT issued just after a COMMIT could plausibly miss the row if the COMMIT hasn't fully landed in the SQLite reader's snapshot view. We don't set WAL mode, so the default rollback-journal isolation is in play with its known edge cases around BEGIN/COMMIT timing.
  2. Session-state edge case at the _clear()reset_profile() → lazy-reinit boundary. This is where the trace points (the get() raises NotExistent immediately after an aiida_profile_clean-using test ran), and it's also where the most session/engine state is being thrown away and rebuilt. Static analysis doesn't surface an obvious bug, but narrow-window bugs can hide in handoff code like this.
  3. An xdist load-distribution test ordering specific to CI. Default load is non-deterministic across runs and across runners, so a specific sequence might tickle a state we wouldn't naturally hit otherwise.

The label-suffix is still the right fix at the test-fixture level regardless of which of these is firing. It makes the conflict structurally impossible by construction.

To your "fix it at the DB level so it can't occur for anything else": I think that's a reasonable instinct, but it's worth noting that "anything else" is in practice an empty set here. The only fixture in the suite that uses a literal label is aiida_localhost itself; every other call into aiida_computer* defaults to a random uuid.uuid4().hex, so by construction those can't collide on UNIQUE(label). So the "computer-level fix" objection is mostly hypothetical: nothing else in the suite is at risk in the first place. Of the DB-level options, INSERT ... ON CONFLICT DO NOTHING RETURNING would work and is what I'd reach for if I were rewriting this from scratch, but it bypasses Computer(...).store(), so we'd lose the AiiDA-side defaults and validation. Removing the UNIQUE constraint on label is a non-starter (users rely on labels being unique identifiers for verdi / load_computer). A filelock per profile would serialise but is heavy machinery for a test fixture.

The label-suffix is cheaper than all of those: doesn't touch production code, doesn't touch the schema, doesn't introduce dialect-specific SQL, makes the conflict structurally impossible by ensuring the fixture's row and any literal-'localhost' row coexist under different names.

On whether to remove #7349: I'd keep it (i.e. not revert it in this PR) as defence-in-depth. The IntegrityError + NotExistent fallback is dead code for any aiida-core test using aiida_localhost (the suffix makes that branch unreachable), but it still catches the case where a direct caller of the lower-level aiida_computer factory passes a literal label, e.g. a third-party plugin test doing aiida_computer_local(label='localhost') — which can happen in third-party plugin test suites we don't control. Reverting now buys us nothing concrete and re-opens that hole; keeping costs zero (it's already merged and tested in tests/tools/pytest_fixtures/test_orm.py).

test_xdist_isolation_demo.py — drop into tests/ and run with -n 4 --db-backend sqlite
"""Show that pytest-xdist workers each get their own AiiDA profile and SQLite DB.

Drop into tests/ (so the AiiDA pytest plugin is active) and run:

    .venv/bin/python -m pytest tests/test_xdist_isolation_demo.py \
        -n 4 -p no:cacheprovider --db-backend sqlite

Each test invocation appends a line to /tmp/aiida_xdist_demo.log (we write to a
file because xdist captures stdout per worker by default).

After the run, eyeball with:

    sort /tmp/aiida_xdist_demo.log

You should see one line per worker_id (gw0, gw1, ...), each with a distinct
profile_name and a distinct filepath. If you see all workers reporting the
same path, the isolation argument is wrong.
"""

from __future__ import annotations

import os
from pathlib import Path

import pytest

LOG_PATH = Path('/tmp/aiida_xdist_demo.log')


@pytest.fixture(scope='session', autouse=True)
def _ensure_log_exists():
    LOG_PATH.touch(exist_ok=True)
    yield


@pytest.mark.parametrize('iteration', range(4))
def test_per_worker_storage_isolation(aiida_profile, worker_id, iteration):
    """Log this worker's profile + storage path. Different workers should differ."""
    line = (
        f'[{worker_id} pid={os.getpid():>6}] '
        f'profile_name={aiida_profile.name} '
        f'storage_backend={aiida_profile.storage_backend} '
        f'filepath={aiida_profile.storage_config.get("filepath", "<n/a>")}\n'
    )
    with LOG_PATH.open('a') as f:
        f.write(line)

    if aiida_profile.storage_backend == 'core.sqlite_dos':
        storage_dir = Path(aiida_profile.storage_config['filepath'])
        assert storage_dir.is_dir(), f'storage dir missing for {worker_id}: {storage_dir}'

@GeigerJ2 GeigerJ2 merged commit 2068a7d into aiidateam:main May 11, 2026
18 checks passed
@GeigerJ2 GeigerJ2 deleted the fix/7347/aiida-localhost-fixture-isolation branch May 11, 2026 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

aiida_profile_clean causes flaky test failures under pytest-xdist

3 participants