Profile data dumping#6723
Conversation
9597527 to
1c4b67b
Compare
ce20e4c to
2dfe2ca
Compare
06bc55e to
b795eda
Compare
02acb56 to
8e0cfdc
Compare
38d2940 to
8d12f11
Compare
458cabd to
355b0e4
Compare
…an`. Exclude `_dumping` from codecov
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…cmethod of `DumpPaths`.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
Data dumping for groups and profiles (#6723) This PR adds functionality to incrementally dump profile and group data Public APIThe data dumping feature can be used from the CLI via the The internal implementation of the feature is contained in the private Configuration of dumpingTo organize the extensive options, a data class for the config options is State of dumping folderThe dumping functionality tracks the state of the dumped folder wrt. to Execution of the dumpThe Finally, code that was previously contained in the |
agoscinski
left a comment
There was a problem hiding this comment.
Thanks for the huge work. Let's merge this giant.
|
congratulation 🎊 hard to image how hard it is to bring > 7000 lines of changes to aiida-core, nice work.
Don't forget to open an issue if you still think it is a good idea 😉 |
Cheers dude! It's probably easier than a fundamental 5-line change bc nobody wants or can review it in detail 😆 (apart from @agoscinski ofc 🫶)
Yeah, I have a meta-issue to track improvements: #6816 |
So, while I finalize things here, I'd say this is ready for a first review. At least the user-facing CLI in
cmd_process.py,cmd_group.py, andcmd_profile.py, as well as the user-facing Python API insrc/aiida/tools/dumping/facades.py, that defines theProcessDumper,GroupDumper, andProfileDumperfacades (simple, user-facing entry points). For anybody reviewing, start from there.I'd say we release the feature as a kind of beta, testing version. Testing all the edge cases and internal implementation probably won't be possible until we have the release planned. I made almost all methods private, and the user anyway doesn't interact with the internal implementation at all, so we can still modify things there without worrying about backwards compatibility. Maybe someone with more experience can chime in what is a common approach here. One could also prepend the internal classes with a leading underscore, or move the files to a subdirectory like
_internal, but I haven't seen that anywhere else in the code base, so maybe it's fine as it is.The main components of the implementation are the following:
dumping/├── config.py:DumpConfigpydanticmodel that holds the various configuration options (as well as some helper Enum classes)├── detect.py:DumpChangeDetectorclass to detect changes between dump operations (based on information from the JSON log file), andDumpNodeQueryclass to query nodes from AiiDA's DB with filters (e.g., time-based, code, computer, already dumped nodes, etc.)├── engine.py: Top-level orchestrator of the dumping operation (setup and teardown operations, e.g., initialize classes such as theDumpLogger, prepare output directory, perform dump, save log)├── facades.py: User-facingDumperclasses withfrom_config,dump, and methods to verify the passed AiiDA entities├── logger.py: MainlyDumpLoggerclass that keeps track of dumped entities and their paths, the dump time, and the groups-to-nodes and nodes-to-groups mappings├── managers/(classes responsible for actually executing the various dump operations)│ ├── deletion.py:DeletionManagerclass that takes care of deleting directories and log entries for entities deleted from AiiDA's DB between dump operations (as detected by theDumpChangeDetector)│ ├── process.py:ProcessDumpManagerto orchestrate dumping of processes, and helper classes (in the previously releasedverdi process dumpfeature, all functionality was contained in theProcessDumperclass instead; this is now more modular). For each encountered process, there are various possible actions (skip(node already dumped and not necessary to dump again),dump_primary(first, "normal" node dump),dump_duplicate(node already dumped elsewhere, but dump again, e.g., for duplicated group or node contained in two groups),symlink(node dump directory already exists, so make symlink))│ └── profile.py:ProfileDumpManagerclass that orchestrates all necessary operations when dumping a profile (group and node deletions, group updates (relabel, node removal/addition), dumping of new nodes and groups)├── mapping.py:GroupNodeMappingclass that holds the group-to-nodes and nodes-to-groups mapping, as well as functionality to get the mappings from AiiDA's DB and calculate the diff between two mappings (used by theDumpChangeDetectorand stored via theDumpLogger)└── utils/├── helpers.py: Various helper classes (mainly dataclasses), e.g.,DumpTimesto track last and current dump time, containers for group and node changes, store classes to hold entities to be dumped/deleted├── paths.py:DumpPathsclass to track top-level dump path, sub-paths during the dumping, and that compiles variousstaticmethods for path modifications during the dumping@unkcpz mentioned using a tree data structure to represent the dumping directory/relationships, rather than a "flat" log as I have it now. This aligns well with the data organization via groups, nested workflows, it could also allow for quick diffs, so I think it's a good idea. However, if I completely modify the implementation now, the feature will never make it into v2.7, so I'd rather make an issue and spend some time investigating that approach later on.
@edan-bainglass mentioned adding
.dumpmethods to the ORM classes. I also think this is interesting, and could possibly be added now already, at least for processes, groups, and profiles. One can do that by just instantiating the facade classes, pass any configuration options viakwargs, and calldumpof the facade, similar to how the functionality is exposed via the CLI. In addition, we discussed removing the (quite lengthy) Python dictionary representations of the dump output directory structures I currently use for the integration tests of the top-level dumper classes, in favor of regression tests or some other representation or YAML.@mikibonacci, I changed the default process dump path such that it uses the node label, if one is available. Could you please check if it now works for your AiiDAlab QE use case?
Pinging also other people in the team for notification and dog-fooding, @mbercx, @agoscinski, @khsrali @superstar54.
Other notes:
--delete-missingoption?graph_traversal_ruleswhen updating directories after a node was deleted.graph_traversal_rulesand addget_nodes_dumptosrc/aiida/tools/graph/graph_traversers.py, as well asAiidaEntitySetfromsrc/aiida/tools/graph/age_entities.py, etc., to first obtain the nodes, and then run the dumping.