Skip to content

aiida_profile_clean causes flaky test failures under pytest-xdist #7347

@agoscinski

Description

@agoscinski

Might be slop but I am just documenting all flaky CI behavior. This one does not have an easy fix

https://github.com/aiidateam/aiida-core/actions/runs/25106933128/job/73570476035?pr=7284

_______________________ ERROR at setup of test_get_by_id _______________________
[gw0] linux -- Python 3.10.20 /home/runner/work/aiida-core/aiida-core/.venv/bin/python3
src/aiida/tools/pytest_fixtures/orm.py:100: in factory
    computer = Computer.collection.get(
src/aiida/orm/entities.py:143: in get
    return res.one()[0]
src/aiida/orm/querybuilder.py:1152: in one
    raise NotExistent('No result was found')
E   aiida.common.exceptions.NotExistent: No result was found

During handling of the above exception, another exception occurred:
.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1965: in _exec_single_context
    self.dialect.do_execute(
.venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py:921: in do_execute
    cursor.execute(statement, parameters)
E   sqlite3.IntegrityError: UNIQUE constraint failed: db_dbcomputer.label

The above exception was the direct cause of the following exception:
src/aiida/storage/psql_dos/orm/utils.py:119: in save
    self.session.commit()
.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py:1923: in commit
    trans.commit(_to_root=True)
<string>:2: in commit
    ???
.venv/lib/python3.10/site-packages/sqlalchemy/orm/state_changes.py:139: in _go
    ret_value = fn(self, *arg, **kw)
.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py:1239: in commit
    self._prepare_impl()
<string>:2: in _prepare_impl
    ???
.venv/lib/python3.10/site-packages/sqlalchemy/orm/state_changes.py:139: in _go
    ret_value = fn(self, *arg, **kw)
.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py:1214: in _prepare_impl
    self.session.flush()
.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py:4179: in flush
    self._flush(objects)
.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py:4314: in _flush
    with util.safe_reraise():
.venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py:147: in __exit__
    raise exc_value.with_traceback(exc_tb)
.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py:4275: in _flush
    flush_context.execute()
.venv/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py:466: in execute
    rec.execute(self)
.venv/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py:642: in execute
    util.preloaded.orm_persistence.save_obj(
.venv/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py:93: in save_obj
    _emit_insert_statements(
.venv/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py:1226: in _emit_insert_statements
    result = connection.execute(
.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1412: in execute
    return meth(
.venv/lib/python3.10/site-packages/sqlalchemy/sql/elements.py:515: in _execute_on_connection
    return connection._execute_clauseelement(
.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1635: in _execute_clauseelement
    ret = self._execute_context(
.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1844: in _execute_context
    return self._exec_single_context(
.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1984: in _exec_single_context
    self._handle_dbapi_exception(
.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py:2339: in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1965: in _exec_single_context
    self.dialect.do_execute(
.venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py:921: in do_execute
    cursor.execute(statement, parameters)
E   sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: db_dbcomputer.label
E   [SQL: INSERT INTO db_dbcomputer (uuid, label, hostname, description, scheduler_type, transport_type, metadata) VALUES (?, ?, ?, ?, ?, ?, ?)]
E   [parameters: ('aa6d37d0-edd5-4661-a4ed-97e5bd90271a', 'localhost', 'localhost', '', 'core.direct', 'core.local', '{"workdir": "/tmp/pytest-of-runner/pytest-0/popen-gw0/test_get_by_id0"}')]
E   (Background on this error at: https://sqlalche.me/e/20/gkpj)

and here
https://github.com/aiidateam/aiida-core/actions/runs/25108262868/job/73586466642?pr=7284

...................[gw0] node down: Not properly terminated
F
_____________________ tests/storage/psql_dos/test_query.py _____________________
[gw0] linux -- Python 3.10.20 /home/runner/work/aiida-core/aiida-core/.venv/bin/python3
worker 'gw0' crashed while running 'tests/storage/psql_dos/test_query.py::test_qb_ordering_limits_offsets_sqla'

replacing crashed worker gw0

and also the REST API error might be related to this
https://github.com/aiidateam/aiida-core/actions/runs/25117114340/job/73607205380

❯ Next this one F                                                                                                                                                                               
_________________________ tests/restapi/test_routes.py _________________________                                                                                                                
[gw0] linux -- Python 3.14.4 /home/runner/work/aiida-core/aiida-core/.venv/bin/python3                                                                                                          
worker 'gw0' crashed while running 'tests/restapi/test_routes.py::TestRestApi::test_computers_orderby_schedulertype_desc'                                                                       
                                                                                                                                                                                                
replacing crashed worker gw0

init_profile is an autouse=True fixture that uses aiida_profile_clean — every test method in TestRestApi triggers reset_storage() on the shared database. Under xdist, this
crashes workers.

Summary

Several tests fail intermittently under pytest-xdist due to a race condition when aiida_profile_clean resets the shared database while other workers are actively using it. This manifests as both IntegrityError failures and outright worker crashes.

Observed in: https://github.com/aiidateam/aiida-core/actions/runs/25106933128/job/73570476092?pr=7284

Failure Category 1: UNIQUE constraint violation (7 tests)

Affected tests (all in tests/cmdline/params/types/test_code.py):

  • test_get_by_id
  • test_get_by_uuid
  • test_get_by_label
  • test_get_by_fullname
  • test_ambiguous_label_pk
  • test_ambiguous_label_uuid
  • test_entry_point_validation

Affected jobs: test minimum reqs (3.10, sqlite), tests-presto

Error:

src/aiida/tools/pytest_fixtures/orm.py:100: in factory
    computer = Computer.collection.get(
E   aiida.common.exceptions.NotExistent: No result was found

During handling of the above exception, another exception occurred:

E   aiida.common.exceptions.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: db_dbcomputer.label
E   [SQL: INSERT INTO db_dbcomputer ... VALUES (..., 'localhost', ...)]

Root cause:

The aiida_profile fixture is session-scoped, so all xdist workers share the same database. test_shell_complete (line 127) uses aiida_profile_clean, which calls reset_storage() and wipes the entire database. This creates a race condition in the aiida_computer fixture (src/aiida/tools/pytest_fixtures/orm.py:100-111):

  1. Worker A resets the database via aiida_profile_clean
  2. Worker A calls Computer.collection.get(label='localhost', ...) -> NotExistent
  3. Worker B (or another fixture in the same worker) also calls .get(...) -> NotExistent
  4. Worker A creates and stores the computer -> succeeds
  5. Worker B tries to create the same computer -> IntegrityError on the label unique constraint

Failure Category 2: Worker crashes (5 tests)

In the same CI run, gw0 crashed ("node down: Not properly terminated") during five different tests across four jobs, with no Python traceback:

Job Test running when worker crashed
tests-presto tests/tools/archive/orm/test_comments.py::test_exclude_comments_flag
tests (3.14, psql, rmq) tests/restapi/test_routes.py::TestRestApi::test_computers_orderby_schedulertype_desc
tests (3.10, psql, zmq) tests/storage/psql_dos/test_backend.py::test_get_info
tests (3.14, psql, zmq) tests/orm/implementation/test_nodes.py::TestBackendNode::test_clear_attributes
tests (3.10, psql, rmq) tests/restapi/test_identifiers.py::test_full_type_unregistered[WorkFunctionNode]

These are likely caused by the same reset_storage() race corrupting database state mid-operation in another worker, leading to segfaults or fatal errors in the storage/ORM layer.

Possible fixes

  1. Give each xdist worker its own profile/database -- eliminates all shared-state races but increases resource usage.
  2. Avoid aiida_profile_clean in xdist runs -- use unique labels/data per test instead of wiping the database.
  3. Handle IntegrityError in aiida_computer fixture -- catch the error and retry with a .get() call:
    try:
        computer = Computer(label=label, ...).store()
    except IntegrityError:
        computer = Computer.collection.get(label=label)
    This only addresses Category 1 and doesn't fix the worker crashes.

Option 1 is the most robust since it eliminates the root cause for both failure categories.

Metadata

Metadata

Assignees

No one assigned

    Labels

    maintainer onlyFundamental, large, and/or technical changes reserved for internal contributions by the core team.type/bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions