🐛 Handle IntegrityError in aiida_computer fixture#7349
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7349 +/- ##
==========================================
+ Coverage 80.25% 80.27% +0.02%
==========================================
Files 577 577
Lines 45498 45507 +9
==========================================
+ Hits 36512 36525 +13
+ Misses 8986 8982 -4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
55a300e to
7c29c05
Compare
4055a54 to
6ada13b
Compare
|
This fix still seems to be needed to solve some of the flaky behavior I observed in #7323 see CI run https://github.com/aiidateam/aiida-core/actions/runs/25172162526/job/73794412601?pr=7323 (fixed when adding this commit and reappeared when removing it) |
6ada13b to
e0a6e48
Compare
e0a6e48 to
eb24f4b
Compare
When `aiida_profile_clean` resets the database mid-session, subsequent calls to the `aiida_computer` fixture can hit a TOCTOU race: `get()` finds no matching computer, but `store()` fails with an IntegrityError because another fixture call already recreated it. Catch the error and fall back to `get()` by label.
Drop the redundant `SqlaIntegrityError` catch. The storage layer (`ModelWrapper.save()`) already converts SQLAlchemy's `IntegrityError` to `aiida.common.exceptions.IntegrityError` before it reaches caller code, so catching the raw SQLAlchemy exception adds a needless dependency on storage internals. Also match the fallback `get()` on all four fields (label, hostname, scheduler_type, transport_type) instead of label alone, consistent with the original `get()` call.
fddf7b6 to
9e1d96f
Compare
Add a test that exercises the except-IntegrityError branch in the `aiida_computer` fixture. Mocks `get()` to miss and `store()` to raise IntegrityError, verifying that the fallback `get()` returns the correct computer.
The previous IntegrityError fallback had two latent issues: 1. ``set_minimum_job_poll_interval`` and ``set_default_mpiprocs_per_machine`` lived in an ``else:`` branch and were skipped on the fallback path, silently dropping caller-provided values when a race fired. 2. The fallback ``get()`` could itself raise ``NotExistent`` if the racing worker's row was already gone, and that propagated as a confusing exception rather than triggering a fresh ``store()``. Move the setters out of ``else:`` so they always run, and add a build-and-store fallback for the rare inner ``NotExistent``. Factor the ``Computer(...)`` construction into a small ``_build()`` closure to avoid duplication.
Collapse the recovery test into a single test parametrized over both branches of the inner ``except IntegrityError``: - ``fallback_get_succeeds``: the fallback ``get()`` finds the racing worker's row and returns it (common case). - ``fallback_get_misses_rebuilds``: the fallback ``get()`` also raises ``NotExistent`` because the racing worker's row vanished; the fixture rebuilds and stores fresh. The second branch was previously uncovered. Adding it closes the patch-coverage gap from the recent ``orm.py`` change and exercises the inner ``except NotExistent`` block that drives the rebuild path.
GeigerJ2
left a comment
There was a problem hiding this comment.
LGTM! Will squash-merge now.
I added some commits on top such that the IntegrityError fallback now handles both the common race (fallback get() finds the racing worker's row) and the rarer double-miss case (rebuild and store fresh). In addition, the caller-provided minimum_job_poll_interval / default_mpiprocs_per_machine now propagate through all recovery paths so they're never silently dropped. Tests cover both branches.
Tried to fix issue #7347 here but this is just a tmp helper fix. This seems to fix category 1 problem but its more a bandaid than a real fix.
When
aiida_profile_cleanresets the database mid-session, subsequent calls to theaiida_computerfixture can hit a TOCTOU race:get()finds no matching computer, butstore()fails with an IntegrityError because another fixture call already recreated it. Catch the error and fall back toget()by label.