Profile data dumping by GeigerJ2 · Pull Request #6723 · aiidateam/aiida-core

GeigerJ2 · 2025-01-22T12:56:30Z

So, while I finalize things here, I'd say this is ready for a first review. At least the user-facing CLI in cmd_process.py, cmd_group.py, and cmd_profile.py, as well as the user-facing Python API in src/aiida/tools/dumping/facades.py, that defines the ProcessDumper, GroupDumper, and ProfileDumper facades (simple, user-facing entry points). For anybody reviewing, start from there.

I'd say we release the feature as a kind of beta, testing version. Testing all the edge cases and internal implementation probably won't be possible until we have the release planned. I made almost all methods private, and the user anyway doesn't interact with the internal implementation at all, so we can still modify things there without worrying about backwards compatibility. Maybe someone with more experience can chime in what is a common approach here. One could also prepend the internal classes with a leading underscore, or move the files to a subdirectory like _internal, but I haven't seen that anywhere else in the code base, so maybe it's fine as it is.

The main components of the implementation are the following:

dumping/
├── config.py: DumpConfig pydantic model that holds the various configuration options (as well as some helper Enum classes)
├── detect.py: DumpChangeDetector class to detect changes between dump operations (based on information from the JSON log file), and DumpNodeQuery class to query nodes from AiiDA's DB with filters (e.g., time-based, code, computer, already dumped nodes, etc.)
├── engine.py: Top-level orchestrator of the dumping operation (setup and teardown operations, e.g., initialize classes such as the DumpLogger, prepare output directory, perform dump, save log)
├── facades.py: User-facing Dumper classes with from_config, dump, and methods to verify the passed AiiDA entities
├── logger.py: Mainly DumpLogger class that keeps track of dumped entities and their paths, the dump time, and the groups-to-nodes and nodes-to-groups mappings
├── managers/ (classes responsible for actually executing the various dump operations)
│ ├── deletion.py: DeletionManager class that takes care of deleting directories and log entries for entities deleted from AiiDA's DB between dump operations (as detected by the DumpChangeDetector)
│ ├── process.py: ProcessDumpManager to orchestrate dumping of processes, and helper classes (in the previously released verdi process dump feature, all functionality was contained in the ProcessDumper class instead; this is now more modular). For each encountered process, there are various possible actions (skip (node already dumped and not necessary to dump again), dump_primary (first, "normal" node dump), dump_duplicate (node already dumped elsewhere, but dump again, e.g., for duplicated group or node contained in two groups), symlink (node dump directory already exists, so make symlink))
│ └── profile.py: ProfileDumpManager class that orchestrates all necessary operations when dumping a profile (group and node deletions, group updates (relabel, node removal/addition), dumping of new nodes and groups)
├── mapping.py: GroupNodeMapping class that holds the group-to-nodes and nodes-to-groups mapping, as well as functionality to get the mappings from AiiDA's DB and calculate the diff between two mappings (used by the DumpChangeDetector and stored via the DumpLogger)
└── utils/
├── helpers.py: Various helper classes (mainly dataclasses), e.g., DumpTimes to track last and current dump time, containers for group and node changes, store classes to hold entities to be dumped/deleted
├── paths.py: DumpPaths class to track top-level dump path, sub-paths during the dumping, and that compiles various staticmethods for path modifications during the dumping

@unkcpz mentioned using a tree data structure to represent the dumping directory/relationships, rather than a "flat" log as I have it now. This aligns well with the data organization via groups, nested workflows, it could also allow for quick diffs, so I think it's a good idea. However, if I completely modify the implementation now, the feature will never make it into v2.7, so I'd rather make an issue and spend some time investigating that approach later on.
@edan-bainglass mentioned adding .dump methods to the ORM classes. I also think this is interesting, and could possibly be added now already, at least for processes, groups, and profiles. One can do that by just instantiating the facade classes, pass any configuration options via kwargs, and call dump of the facade, similar to how the functionality is exposed via the CLI. In addition, we discussed removing the (quite lengthy) Python dictionary representations of the dump output directory structures I currently use for the integration tests of the top-level dumper classes, in favor of regression tests or some other representation or YAML.
@mikibonacci, I changed the default process dump path such that it uses the node label, if one is available. Could you please check if it now works for your AiiDAlab QE use case?

Pinging also other people in the team for notification and dog-fooding, @mbercx, @agoscinski, @khsrali @superstar54.

Other notes:

Add experimental warning in the same way as in pydantic PR
What happens if I delete a calculation that was called by another workchain, from AiiDA's DB, and I run with the --delete-missing option?
Possibly use graph traversal rules for recursion during node selection (especially process nodes)
Possibly also use graph_traversal_rules when updating directories after a node was deleted.
Possibly use graph_traversal_rules and add get_nodes_dump to src/aiida/tools/graph/graph_traversers.py, as well as AiidaEntitySet from src/aiida/tools/graph/age_entities.py, etc., to first obtain the nodes, and then run the dumping.

…an`. Exclude `_dumping` from codecov

for more information, see https://pre-commit.ci

…cmethod of `DumpPaths`.

…le-mirror

for more information, see https://pre-commit.ci

…le-mirror

GeigerJ2 · 2025-06-05T14:57:44Z

@agoscinski

Data dumping for groups and profiles (#6723)

This PR adds functionality to incrementally dump profile and group data
into a human-readable output folder, and refactors the internal logic of
the recently released process dumping.

Public API

The data dumping feature can be used from the CLI via the
verdi {profile|group|process} dump commands. Furthermore,
the classes aiida.manage.configuration.Profile, aiida.orm.Group and
aiida.orm.ProcessNode are extended by a new public member function
dump that takes the same dumping options as the CLI entry points.

The internal implementation of the feature is contained in the private
module aiida.tools._dumping, which is currently excluded from
codecov. Further testing and modifications will be applied based on
user feedback in smaller, more manageable PRs.

Configuration of dumping

To organize the extensive options, a data class for the config options is
created using pydantic BaseModel in the aiida.tools._dumping.config
module. For each type of dumping (process/group/profile) different
options are available contained in three config classes
ProcessDumpConfig, ProfileDumpConfig, GroupDumpConfig all
inheriting from BaseDumpConfig. The *DumpConfig classes use the
mixin pattern to organize different options via the TimeFilterMixin,
EntityFilterMixin, ProcessHandlingMixin and GroupManagementMixin
since they are not available for each type of entity being dumped. The
new CLI entry points and the new member function dump all map their
inputs to the respective config class. By mapping both inputs to the
*DumpConfig classes the validation process is unified, reducing code
duplication.

State of dumping folder

The dumping functionality tracks the state of the dumped folder wrt. to
the AiiDA database. This requires a persistent storage of the current
state of the dumping folder as well as logic comparing the state of the
dumping folder and the database. To prevent expensive file reads of the
dumping folder, the state is stored in a json file after the dumping
process. The logic to evaluate and track the state and compare it with
the database is contained in the modules aiida.tools._dumping.tracking
and aiida.tools._dumping.mapping, and changes since the last dump
(new/deleted nodes/groups, relabeled groups, node membership changes,
etc.) are then picked up via the module aiida.tools._dumping.detector.
This enables incremental dumping of data in a way that the
human-readable output folder of the dumping feature tracks the state of
the AiiDA DB as it evolves.

Execution of the dump

The aiida.tools._dumping.engine module is responsible for the
top-level orchestration of the dumping process (including reading in the
json state file from the previous dump, or deleting it, if overwrite
mode is selected), as well as common setup and teardown operations. For
group and profile dumping, changes in AiiDA's DB since the last dump
that are not yet reflected in the dumping output folder, are then
carried out incrementally. This includes deleting output directories of
nodes and groups that were previously dumped but were since deleted from
AiiDA's DB, applying group relabeling carried out by the user, as well
as dumping new nodes and groups. This functionality is contained in the
aiida.tools._dumping.executors that provides the DeletionExecutor,
ProcessDumpExecutor, GroupDumpExecutor, and ProfileDumpExecutor,
classes and presents the meat of the feature implementation.

Finally, code that was previously contained in the ProcessDumper class
is now moved to the ProcessDumpExecutor, while a ProcessDumper
"facade" class is still provided via the aiida.tools._dumping.facades
module, and exposed as public API for backwards compatibility.

agoscinski

Thanks for the huge work. Let's merge this giant.

unkcpz · 2025-06-06T09:28:16Z

congratulation 🎊 hard to image how hard it is to bring > 7000 lines of changes to aiida-core, nice work.

using a tree data structure to represent the dumping directory/relationships, rather than a "flat" log as I have it now.

Don't forget to open an issue if you still think it is a good idea 😉

GeigerJ2 · 2025-06-11T15:34:07Z

congratulation 🎊 hard to image how hard it is to bring > 7000 lines of changes to aiida-core, nice work.

Cheers dude! It's probably easier than a fundamental 5-line change bc nobody wants or can review it in detail 😆 (apart from @agoscinski ofc 🫶)
Might also have been possible to achieve the functionality in less code, but I gave my best in terms of code design with various refactors ^^

using a tree data structure to represent the dumping directory/relationships, rather than a "flat" log as I have it now.

Don't forget to open an issue if you still think it is a good idea 😉

Yeah, I have a meta-issue to track improvements: #6816

GeigerJ2 force-pushed the feature/verdi-profile-mirror branch 3 times, most recently from 9597527 to 1c4b67b Compare January 23, 2025 16:07

GeigerJ2 mentioned this pull request Jan 28, 2025

Add ArithmeticAdd CJ Node fixture without run or submit #6733

Open

GeigerJ2 force-pushed the feature/verdi-profile-mirror branch from ce20e4c to 2dfe2ca Compare January 28, 2025 16:26

GeigerJ2 force-pushed the feature/verdi-profile-mirror branch 5 times, most recently from 06bc55e to b795eda Compare February 17, 2025 16:30

agoscinski assigned GeigerJ2 Feb 19, 2025

GeigerJ2 force-pushed the feature/verdi-profile-mirror branch 7 times, most recently from 02acb56 to 8e0cfdc Compare February 25, 2025 17:50

GeigerJ2 force-pushed the feature/verdi-profile-mirror branch from 38d2940 to 8d12f11 Compare March 12, 2025 08:42

GeigerJ2 force-pushed the feature/verdi-profile-mirror branch 11 times, most recently from 458cabd to 355b0e4 Compare April 2, 2025 16:33

GeigerJ2 and others added 25 commits June 4, 2025 10:42

Fix codes and computers type annotation

05e962c

Remove offending annotations for RTD

add67ea

Fix type annotation for groups input of config model.

5a2b92b

Expose ProcessDumper via 'public' API. Remove some `aiida_profile_cle…

e109917

…an`. Exclude `_dumping` from codecov

Fix RTD

920161e

[pre-commit.ci] auto fixes from pre-commit.com hooks

ca3133a

for more information, see https://pre-commit.ci

Ignore _dumping module from codecov in pyproject.toml

ec1df86

Add minimal CLI tests for dumping feature

8c23bfb

Fix broken dry-run for process dump

4188200

Expand process dump API and CLI tests

35638a3

Fix rtd

65b7ec9

Add tests for group dump CLI

0008f8d

[pre-commit.ci] auto fixes from pre-commit.com hooks

864ca34

for more information, see https://pre-commit.ci

Add tests to orm.Group dump endpoint

4cea549

Start adding dump API tests to profile

5d53642

[pre-commit.ci] auto fixes from pre-commit.com hooks

48499d4

for more information, see https://pre-commit.ci

Almost done with tests for Profile.dump()

e85d63c

Finalize Profile.dump tests and make _safe_delete_directory stati…

ec0706c

…cmethod of `DumpPaths`.

Merge remote-tracking branch 'upstream/main' into feature/verdi-profi…

2647320

…le-mirror

[pre-commit.ci] auto fixes from pre-commit.com hooks

58b4370

for more information, see https://pre-commit.ci

Fix tests breaking for groups through past accidental search-and-replace

56e8ae3

Fix mypy for group types

4193ebd

[pre-commit.ci] auto fixes from pre-commit.com hooks

e5e90ec

for more information, see https://pre-commit.ci

Add minimal CLI inferface tests for verdi profile dump

435a044

Merge remote-tracking branch 'upstream/main' into feature/verdi-profi…

0b77ca1

…le-mirror

agoscinski approved these changes Jun 5, 2025

View reviewed changes

edan-bainglass mentioned this pull request Nov 20, 2025

Update archive downloader w.r.t changes in aiida-core aiidalab/aiidalab-qe#1420

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile data dumping#6723

Profile data dumping#6723
GeigerJ2 merged 86 commits into
aiidateam:mainfrom
GeigerJ2:feature/verdi-profile-mirror

GeigerJ2 commented Jan 22, 2025 •

edited

Loading

Uh oh!

GeigerJ2 commented Jun 5, 2025 •

edited

Loading

Uh oh!

agoscinski left a comment

Uh oh!

unkcpz commented Jun 6, 2025 •

edited

Loading

Uh oh!

GeigerJ2 commented Jun 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

GeigerJ2 commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other notes:

Uh oh!

GeigerJ2 commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Public API

Configuration of dumping

State of dumping folder

Execution of the dump

Uh oh!

agoscinski left a comment

Choose a reason for hiding this comment

Uh oh!

unkcpz commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GeigerJ2 commented Jun 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

GeigerJ2 commented Jan 22, 2025 •

edited

Loading

GeigerJ2 commented Jun 5, 2025 •

edited

Loading

unkcpz commented Jun 6, 2025 •

edited

Loading