Skip to content

ORM: Use pydantic to specify a schema for each ORM entity#6255

Merged
edan-bainglass merged 13 commits into
aiidateam:mainfrom
sphuber:feature/orm-pydantic-models
Apr 30, 2025
Merged

ORM: Use pydantic to specify a schema for each ORM entity#6255
edan-bainglass merged 13 commits into
aiidateam:mainfrom
sphuber:feature/orm-pydantic-models

Conversation

@sphuber
Copy link
Copy Markdown
Contributor

@sphuber sphuber commented Jan 18, 2024

This PR addresses AEP 010 by introducing a schema interface for AiiDA's ORM entities using pydantic. The goal is to make the structure and content of ORM classes introspectable and serializable, enabling external applications—such as web APIs or CLIs—to interface with AiiDA in a programmatic, schema-driven way. Each core ORM class now defines a Model subclass that exposes its fields using rich metadata annotations. These models support conversion to and from ORM instances (_from_model, _to_model), as well as a user API yielding JSON-compatible formats (from_serialized, serialize). The serialization mechanism is experimental and is stated to the user as such on use of the serialization methods.

In the process of implementing the original proposal, the MetadataField utility was extended with additional metadata such as exclude_to_orm, exclude_from_cli, orm_class, orm_to_model, and model_to_orm, enabling precise control over how fields behave across layers. A helper function get_metadata was added to facilitate metadata retrieval from FieldInfo objects.

The PR refactors the AuthInfo, Comment, and PortableCode classes to adopt this new schema interface. For AuthInfo and Comment, equality checks are now defined based on the models to improve testability and object comparison. The PortableCode now also supports PurePath for the filepath_executable on initialization, though the type of its model's filepath_files is corrected to str only, reflecting how it is stored in the database. The serialization of repository files is unified for the filepath_files field and yaml-export functionality.

For objects with repository-stored files, on serialization, files are written to a user-specified directory (or a temporary one if not provided) and the paths are stored in the serialized model. The from_serialized method can then be used to reconstruct the object, including its files, from the serialized data.

Furthemore, CLI support was also expanded: dynamic option generation now respects exclude_from_cli and allows prioritizing or customizing CLI arguments using schema metadata. The create_code command and dynamic group listing logic were updated accordingly to use the new models for constructing instances.

Additionally, entry points were registered for core ORM classes to improve plugin discoverability, and a new FilePath typing alias was introduced for shared use across the codebase.

@sphuber sphuber force-pushed the feature/orm-pydantic-models branch 3 times, most recently from 3848afd to c253259 Compare January 22, 2024 12:53
@sphuber sphuber force-pushed the feature/orm-pydantic-models branch from c253259 to fc52f41 Compare January 30, 2024 16:11
@sphuber sphuber force-pushed the feature/orm-pydantic-models branch 2 times, most recently from 49379d9 to d1659b4 Compare March 12, 2024 07:00
@sphuber sphuber force-pushed the feature/orm-pydantic-models branch 3 times, most recently from a7a6065 to 069bd1f Compare March 20, 2024 16:45
@sphuber sphuber force-pushed the feature/orm-pydantic-models branch 2 times, most recently from 655fcf0 to d778da8 Compare March 25, 2024 12:35
@sphuber sphuber force-pushed the feature/orm-pydantic-models branch from 07f3001 to 054c43c Compare March 27, 2024 21:55
@sphuber sphuber force-pushed the feature/orm-pydantic-models branch from 054c43c to 1c65856 Compare April 19, 2024 12:15
@sphuber sphuber force-pushed the feature/orm-pydantic-models branch from 7c7ef64 to 1e7f536 Compare April 19, 2024 12:58
@danielhollas
Copy link
Copy Markdown
Collaborator

@sphuber just wondering, have you checked how this affects performance?
Does it influences import aiida.orm at all?

mikibonacci added a commit to mikibonacci/aiida-atomistic that referenced this pull request May 8, 2024
This is implemented following the pydantic PR for aiida-core: aiidateam/aiida-core#6255
def serialize_computer(self, computer: Computer, _info):
return computer.label

# @model_validator(mode='after') # type: ignore[misc]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like some code that was forgotten

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking out this PR @agoscinski . For some background info: I commented this out on purpose for now. Ideally, this check should be there because when storing the code, this condition should be satisfied. Unfortunately, however, this now also prevents just instantiating this model that validates the check, even if we don't plan on storing it. This is problematic for the export behavior, because that would rely on instantiating a Model that violates this uniqueness rules, even though in this case it doesn't apply.

The old CLI verdi code create core.code.installed would also rely on something like this check to immediately return an error as soon as label was specified in interactive mode and that label already existed for the same computer. It would then reprompt the user asking for a different label. I haven't been able to reimplement this feature in this PR so was keeping the comment as a reminder.

Safe to say that this PR is not completely done yet 😅

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still "in progress"?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sphuber I'm looking at this now. Trying to probe the issue you mention in this thread. I did the following:

  1. Create a hello_world@localhost installed code using verdi
  2. Load the code and serialize it
  3. Create an InstalledCode instance from 2
    I expected an issue here, as from_serialized leverages the model, so the validator should've been triggered, raising an exception, as the label + computer already exist! No exception though.

As for your comment regarding the CLI, are you saying a similar validation should be implemented there? Sounds like it, but not in this PR, right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing a recent meeting with @sphuber:

  • The currently commented out validator on the Code.Model is intended as a check on the uniqueness of Code label+computer.label tuples
  • However, one use of the pydantic Model of a code, the dynamic creation of the interactive options of verdi code create..., requires the Model only for access to the available properties. Unfortunately, the validator will fire on instantiation from the serialized code, which will conflict with the uniqueness constraint, making it impossible to use the Model in this case
  • The move to pydantic Models for this CLI application regresses the user experience in the sense that it does not fire the validator until after the user finishes the interactive part, even though the tuple is already available within the first three options (note that this is not tied to this PR, as the use of pydantic Models for this case was implemented in an earlier PR)

The CLI matter is not blocking for this PR, as I understand it. A solution can be explored in a follow-up PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Thanks for clarifying this matter 🙏

Copy link
Copy Markdown
Member

@edan-bainglass edan-bainglass Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think maybe there's another potentially critical issue here.

I tried reconstructing an installed code instance. With the validation active on the model, I first need to make the above change (and related changes) to Entity.to_model(). This allows me to serialize the code. However, deserializing it using orm.InstalledCode.from_serialized(**code.serialize()) also raises the validation error. But here we shouldn't assume pre-validated input, so we can't use the model_construct solution in from_serialized.

I'm not sure how to get around this matter and allow the model validator. A REST API receiving a a serialized Code instance cannot assume the data is valid. In the case that the data came from the same AiiDA instance with which the REST API is linked, the validation will surely object.

@sphuber am I overlooking something?

update

Thinking on this some more, "real" use cases of REST API node transfer will likely be between AiiDA instances, so the collision of the same full code label should not happen? But even if this is true, a specialized warning would be better. Not sure where to implement it though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on this, if no strong opinions regarding specialized warnings, I believe this matter is closed.

Copy link
Copy Markdown
Collaborator

@agoscinski agoscinski Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what your conclusion is @edan-bainglass but my understanding is that you would not replace the constructor to model_construct which I agree with. Since this would skip all validation as well as any resolving of nested members. It seems quite clear from the whole conversation that validations related to database relationships (like in this case uniqueness) should not be part of the pydantic model (at least in AiiDA which has so far separated the storage from the initialization) but should be moved to another level. If we agree on this we can remove this validation with referencing an issue about the CLI validation.

EDIT: Note that this PR establishes the serialization of objects through their unique identifier (node --serialization--> pk) which relies on the fact on deserialization that the node is present in the database. Such validation checks would break any of these cases. I think this just enforces the argument that such validations should not be in the pydantic model and rather be implemented on endpoints such as the CLI that are responsible for initialization and storing of nodes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agoscinski getting a chance to look at this once more as part of the REST API work. Regarding the use of model_construct, as I see it now, it seems "safe" to use it in Entity.to_model, as there is no need for validation here. The entity is already in the DB. We SHOULD NOT use it for deserialization (construction from model), which we are not.

Comment thread pyproject.toml Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to make orm pluggable, what is the use case of it?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sphuber correct me if I'm wrong, but this has to do with the use of orm_class, right? Under the implementation section of the AEP, it states that orm_class is a shortcut to model_to_orm used to define a model field that points to an ORM instance. So instead of

class Model(...):
    user = MetadataField(
        ...,
        model_to_orm=lambda pk: User.collection.get(id=pk),
    )

you can do

class Model(...):
    user = MetadataField(
        ...,
        orm_class=User,
    )

The various ORM classes that may be used in orm_class are then defined in project.entry-points.'aiida.orm', so that you can do in Entity.model_to_orm_field_values(...):

if orm_class := get_metadata(field, 'orm_class'):
    if isinstance(orm_class, str):
        orm_class = BaseFactory('aiida.orm', orm_class)
        fields[key] = orm_class.collection.get(id=field_value)

But then I guess my question is, do these not already have entry points? Or is the problem that you can't fetch them all generically using the BaseFactory?

@sphuber sphuber force-pushed the feature/orm-pydantic-models branch from a7c6c13 to 9816200 Compare July 8, 2024 07:27
@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 8, 2024

Codecov Report

❌ Patch coverage is 94.10569% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.44%. Comparing base (eb34b06) to head (863f9ca).
⚠️ Report is 60 commits behind head on main.

Files with missing lines Patch % Lines
src/aiida/orm/entities.py 87.84% 9 Missing ⚠️
src/aiida/orm/comments.py 70.59% 5 Missing ⚠️
src/aiida/orm/authinfos.py 86.96% 3 Missing ⚠️
src/aiida/orm/fields.py 81.25% 3 Missing ⚠️
src/aiida/common/pydantic.py 83.34% 2 Missing ⚠️
src/aiida/orm/nodes/node.py 93.11% 2 Missing ⚠️
src/aiida/cmdline/groups/dynamic.py 87.50% 1 Missing ⚠️
src/aiida/orm/groups.py 94.12% 1 Missing ⚠️
src/aiida/orm/nodes/data/folder.py 0.00% 1 Missing ⚠️
src/aiida/orm/nodes/data/singlefile.py 93.75% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6255      +/-   ##
==========================================
+ Coverage   78.29%   78.44%   +0.15%     
==========================================
  Files         566      567       +1     
  Lines       42766    43063     +297     
==========================================
+ Hits        33481    33776     +295     
- Misses       9285     9287       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@khsrali
Copy link
Copy Markdown
Collaborator

khsrali commented Jul 11, 2024

It would be nice to have some benchmarks,
Just to see if these changes would slow down AiiDA and if, how much. So to evaluate pros and cons.

agoscinski pushed a commit to aiidateam/aiida-atomistic that referenced this pull request Jul 17, 2024
This is implemented following the pydantic PR for aiida-core: aiidateam/aiida-core#6255
edan-bainglass and others added 8 commits April 30, 2025 06:36
try to fix filepath_files by introducing byte storage for (de)serialization

make filepath_files default None also in pydantic model

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

wip different approach including repo export

add tmp_path

update regression test

cleanup code, introduce now path as second variable

put to_model, from_model private, fix missing repository_path

add serialization tests

mark serialize and from_serialized as private

Revert "mark serialize and from_serialized as private"

This reverts commit b09a7b9.

Throw warning when serialization interface is used

Add tests for codes

apply review after talking with geigerj2

apply review after talking with geigerj2

fix regression tests

Add path support for filepath_executable

remove code that was purely for debugging

add conversion of path to str to support path for filepath_executable

use py3.9 compatible replacement for Path.walk
Issues has been solved in aiidateam#6824 and split to other issues since it needs
a solution outside of the pydantic model.
@agoscinski agoscinski force-pushed the feature/orm-pydantic-models branch from aa70388 to eb93bf2 Compare April 30, 2025 04:36
agoscinski and others added 3 commits April 30, 2025 08:08
For now only `FilePath` has been added and replaced everywhere.
This was a misunderstanding so far: while the object initializer
supports path as input, we store the path in the database always as
string. Since the pydantic model is used for query results to the
database we need to specify the type of the object in the database.
@agoscinski
Copy link
Copy Markdown
Collaborator

agoscinski commented Apr 30, 2025

We solved the problems with the CI I mentioned in my last post in #6824 by using the same approach as in the export to yaml functionality in PortableCode. We dump the repo files in a folder that is referenced on serialization. This is why orm_to_model and subsequently serialize now takes a path argument to dump potentially files from the repo to a folder. If no folder is passed we use a temporary folder. This also can be used in the future to handle the issue with the large bytearrays we now store for some orm data types in the model.

While working on this PR I consistently find subtle issues that indicate to me that the feature is not stable yet, but do not prevent users to try it out out. Therefore I put a warning when the serialize function is used that this is an experimental feature and is likely to break in future releases. I created an issue to track the overall progress of the model integration #6842. From my side with these changes this PR is mergable.

@edan-bainglass
Copy link
Copy Markdown
Member

Thanks @agoscinski. This looks great! The new common.typing module will be useful going forward. And thanks for the clear explanation regarding str types in the Model.

@edan-bainglass edan-bainglass merged commit 958bfd0 into aiidateam:main Apr 30, 2025
25 checks passed
@github-project-automation github-project-automation Bot moved this from In progress to Done in aiida-core v2.7.0 Apr 30, 2025
GeigerJ2 pushed a commit to GeigerJ2/aiida-core that referenced this pull request May 1, 2025
…6255)

This PR addresses [AEP
010](https://github.com/aiidateam/AEP/blob/983a645c9285ba65c7cf07fe6064c23e7e994c06/010_orm_schema/readme.md)
by introducing a schema interface for AiiDA's ORM entities using `pydantic`. The
goal is to make the structure and content of ORM classes introspectable and
serializable, enabling external applications—such as web APIs or CLIs—to
interface with AiiDA in a programmatic, schema-driven way. Each core ORM class
now defines a `Model` subclass that exposes its fields using rich metadata
annotations. These models support conversion to and from ORM instances
(`_from_model`, `_to_model`), as well as a user API yielding JSON-compatible
formats (`from_serialized`, `serialize`). The serialization mechanism is
experimental and is stated to the user as such on use of the serialization
methods.

In the process of implementing the original proposal, the `MetadataField`
utility was extended with additional metadata such as `exclude_to_orm`,
`exclude_from_cli`, `orm_class`, `orm_to_model`, and `model_to_orm`, enabling
precise control over how fields behave across layers. A helper function
`get_metadata` was added to facilitate metadata retrieval from `FieldInfo`
objects.

The PR refactors the `AuthInfo`, `Comment`, and `PortableCode` classes to adopt
this new schema interface. For `AuthInfo` and `Comment`, equality checks are now
defined based on the models to improve testability and object comparison. The
`PortableCode` now also supports `PurePath` for the `filepath_executable` on
initialization, though the type of its model's `filepath_files` is corrected to
`str` only, reflecting how it is stored in the database. The serialization of
repository files is unified for the `filepath_files` field and yaml-export
functionality.

For objects with repository-stored files, on serialization, files are written to
a user-specified directory (or a temporary one if not provided) and the paths
are stored in the serialized model. The `from_serialized` method can then be
used to reconstruct the object, including its files, from the serialized data.

Furthemore, CLI support was also expanded: dynamic option generation now
respects `exclude_from_cli` and allows prioritizing or customizing CLI arguments
using schema metadata. The `create_code` command and dynamic group listing logic
were updated accordingly to use the new models for constructing instances.

Additionally, entry points were registered for core ORM classes to improve
plugin discoverability, and a new `FilePath` typing alias was introduced for
shared use across the codebase.

---------

Co-authored-by: Edan Bainglass <edan.bainglass@gmail.com>
Co-authored-by: Alexander Goscinski <alex.goscinski@posteo.de>
Co-authored-by: Alexander Goscinski <alexander.goscinski@posteo.de>
khsrali pushed a commit that referenced this pull request May 9, 2025
…6844)

pydantic models merged with PR #6255 and commit 958bfd0 missed two classes. 

This commit Update/add models for `RemoteData` and `RemoteStashCompressedData`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants