Skip to content

Corrects the Log migrations, improves import/export#2447

Merged
sphuber merged 2 commits into
aiidateam:provenance_redesignfrom
szoupanos:issue_1759_vb3
Feb 7, 2019
Merged

Corrects the Log migrations, improves import/export#2447
sphuber merged 2 commits into
aiidateam:provenance_redesignfrom
szoupanos:issue_1759_vb3

Conversation

@szoupanos
Copy link
Copy Markdown
Contributor

This PR improves the DbLog migrations for Django & SQLA that were created for issues #1102 and
#1759 with PR #2393.

More specifically the changes are described in issue #2423.
Also the objuuid was removed and objpk is used.
Import/export was adapted by @CasperWA and log is exported as Node records (with their pk which is recreated during export/import to avoid pk collisions).

We have to see what happens when a node is re-imported with more log entries - We should import the new log entries based on their UUID. A new ticket should be created for this.

@coveralls
Copy link
Copy Markdown

coveralls commented Feb 3, 2019

Coverage Status

Coverage increased (+0.1%) to 69.71% when pulling f4a5757 on szoupanos:issue_1759_vb3 into 309b2f2 on aiidateam:provenance_redesign.

Copy link
Copy Markdown
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor changes and a question

Comment thread aiida/backends/djsite/db/migrations/0024_dblog_update.py Outdated
Comment thread aiida/backends/djsite/db/migrations/0024_dblog_update.py Outdated
Comment thread aiida/backends/djsite/db/migrations/0024_dblog_update.py Outdated
Comment thread aiida/backends/djsite/db/subtests/migrations.py Outdated
Comment thread aiida/backends/djsite/db/subtests/migrations.py Outdated
Comment thread aiida/backends/sqlalchemy/models/log.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to demand it only be passed in the kwargs or directly as metadata? Who is calling this constructor, only us correct? Think it might be better to make sure of this from the caller side.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why should one use connect here and begin in the actual tests?

Comment thread aiida/orm/implementation/django/logs.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbnode_id is required correct? So should not have None as default

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, dbnode_id should not have a default

Comment thread aiida/orm/logs.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove default for dbnode_id

@sphuber sphuber force-pushed the issue_1759_vb3 branch 2 times, most recently from c228045 to 476f850 Compare February 4, 2019 14:19
@sphuber
Copy link
Copy Markdown
Contributor

sphuber commented Feb 4, 2019

@giovannipizzi one thing I just realized is that the report method of the Process class prints the pk of its node in the header of the log message, e.g.:

[122|PwBandsWorkChain|run_relax]: launching PwRelaxWorkChain<124>

In addition, often people (like myself) will use pks in the message itself to reference other nodes. Of course both of these will no longer make sense when the reports are imported into another database. Of course we cannot migrate this since we do not know what part of the string is intended as a pk. However, we probably should make this clear to developers and maybe advice to use UUIDs instead. As for the prefix, that is added by us in the report method. Maybe we should already start using UUIDs there, or make it dynamic in the __str__ of a log record and upon formatting print the pk corresponding to the dbnode_id of the record. I don't think we can/should address these issues in this PR but wanted to have your thoughts

Copy link
Copy Markdown
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few changes

Comment thread aiida/backends/djsite/db/migrations/0024_dblog_update.py Outdated
Comment thread aiida/backends/djsite/db/migrations/0024_dblog_update.py Outdated
Comment thread aiida/backends/djsite/db/migrations/0024_dblog_update.py Outdated
Comment thread aiida/backends/djsite/db/migrations/0024_dblog_update.py Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I get here, I suggest a different approach:

  • define two utility functions to get the 'Django query' object, and then call .count(), .delete(), .values, ... where appropriate. Now you define that methods for some of them (and you don't always use them) and here you have to write again the query

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do this, even if I don't find it a big duplication of code and I did it like this in order to be very comparable to what happens at the SQLA migrations (where there are SQL queries and we can not perform what you propose).

Comment thread aiida/backends/sqlalchemy/migrations/versions/041a79fc615f_dblog_cleaning.py Outdated
Comment thread aiida/backends/sqlalchemy/models/log.py Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same change as suggested for Django

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread aiida/backends/tests/export_and_import.py Outdated
Comment thread aiida/cmdline/commands/cmd_export.py Outdated
@giovannipizzi
Copy link
Copy Markdown
Member

one thing I just realized is that the report method of the Process class prints the pk of its node in the header of the log message, e.g.:

[122|PwBandsWorkChain|run_relax]: launching PwRelaxWorkChain<124>

In addition, often people (like myself) will use pks in the message itself to reference other nodes. Of course both of these will no longer make sense when the reports are imported into another database. Of course we cannot migrate this since we do not know what part of the string is intended as a pk. However, we probably should make this clear to developers and maybe advice to use UUIDs instead.

Agreed. We can't migrate this and it's up to developers, and good to clarify this.

As for the prefix, that is added by us in the report method. Maybe we should already start using UUIDs there, or make it dynamic in the __str__ of a log record and upon formatting print the pk corresponding to the dbnode_id of the record. I don't think we can/should address these issues in this PR but wanted to have your thoughts

Agreed that we fix this in a later PR. In the DB now things are correct. If this is only about printing in the AiiDA log text file maybe having a UUID is not a big deal, though (and probably more convenient)?

@sphuber
Copy link
Copy Markdown
Contributor

sphuber commented Feb 4, 2019

@szoupanos and @giovannipizzi I have addressed most of the comments of Giovanni.

@szoupanos please see if you can address the comments about deduplicating the code in the migrations and making sure that in the reverse operations, the objpk is populated with the dbnode_id to ensure that we do not lose any data. Also the comment about why a transaction is used in the deletion of the non-node log entries I did not answer nor changed the code, please verify that one as well.

Notice that since we have two commits that we want to keep separate, I had to amend the individual commits to affect the changes requested and could not add another commit. Unfortunately then this will make it more difficult to view incremental changes.

@sphuber sphuber force-pushed the issue_1759_vb3 branch 3 times, most recently from 9b53ab4 to 4fea9ae Compare February 4, 2019 22:11
@sphuber
Copy link
Copy Markdown
Contributor

sphuber commented Feb 4, 2019

@szoupanos if you want to adapt your commit tomorrow, you can use git rebase -i HEAD~2 and then mark your commit as edit and keep Casper's as keep. Then it will stop at your commit for you to make changes. You make the changes, call git add -u and then git commit --amend. When that is successful, finally call git rebase --continue and everything should be good. Then simple force push as usual

@giovannipizzi
Copy link
Copy Markdown
Member

@szoupanos I've discussed with @sphuber yesterday. Feel free not to apply my comments where they apply only to code style, considering these are only migrations and not code that has to be maintained.

Instead for the backward migration: as I commented, now we don't prevent going back. Then, we do NOT want to be able to go back in time (not possible, as we delete data) but that if I go forward (applying the migration), then back and then forward again, the data is not destroyed.
This just requires putting back the data in the objpk column, hopefully it's a simple copy-paste and change of code.

@szoupanos
Copy link
Copy Markdown
Contributor Author

@giovannipizzi I had a short discussion with @sphuber (in order to get the bigger picture because I was a bit lost with the comments)
I will create the backward data migrations. I can also re-create the data of the objname (approximately) by copying the node type which I am going to do.

For code, duplication, since we all agree that it is not a major issue, let's leave the code as it is in order to finish this PR ASAP.

@szoupanos szoupanos force-pushed the issue_1759_vb3 branch 2 times, most recently from 4aae372 to 0c767de Compare February 6, 2019 17:34
szoupanos and others added 2 commits February 7, 2019 15:30
In commit `c0fba2e38557fe1ea535d1dc24076cba99212616`, a migration was
introduced for the `DbLog` table in order to support the export and
import of log entries. However, this migration was flawed and would
lose valuable information.

The old `DbLog` was designed to support adding log messages to various
entities. Because these entities were not restricted to just nodes, and
therefore could live in different database tables, a foreign key could
not be used for the relationship, but instead an `objpk` and `objname`
column were used to allow multiple table references. However, for the
import to know whether an exported log record is already present in the
database and therefore should not be imported again, the object pk could
not be used as these do remain the same between databases. To this end
the migration added a UUID column to the `DbLog` table and dropped the
`objpk` column. However, the data was not migrated before doing so and
so the connection between logs and their entities was lost.

After the faulty commit, we realized that the reason for not using the
foreign key was that logs needed to be able to be associated with both
nodes as well as legacy workflows, but since support for the latter had
already been dropped this was no longer necessary. Given that now only
nodes can have associated logs, we can revert the `objpk` to be a proper
foreign key and the `objname` can be dropped. Since the logs for legacy
workflows (and any other unexpected entities) will be dropped from the
database, they should be exported to a file for archival purposed.

In summary, the migrations perform the following logical actions:

 * Export existing logs for legacy worklows and other entities to file
 * Delete records from legacy workflows and other entities from `dblog`
 * Create a foreign key `dbnode_id`
 * Migrate data from `objpk` to `dbnode_id`
 * Create a new column `uuid` for `DbLog` that is unique
 * Delete the `objpk` and `ojbname` columns from `DbLog`
 * Delete `objpk` and `objname` from the `metadata` of `DbLog` records

Note that the addition of the unique `uuid` column had to be done in
three steps. The problem is that the default value for the UUID is
generated by a python function which cannot therefore be done on a
database level but the values have to be set after the creation of the
column. However, as soon as the column is created the uniqueness
constraint is violated since it will be empty for all the existing
records. For that reason the migration is split in three steps. In the
first, the column is created and set as nullable. The second step will
populate the records using the python function to generate the UUIDs and
finally, the third migration will alter the column to be unique.

Note that for Django this migration could be written in a single file,
but for SqlAlchemy it had to be broken into several migrations. This is
because of Alembic, the framework used to perform SqlAlchemy migrations,
which had trouble with setting the UUIDs after creation of the column in
a single transaction. To circumvent this problem, the migration was split
into three separate migrations and Alembic is now instructed to use a
single transaction per revision, mirroring the migration behavior of
Django.
To make the export of `Log` entities possible, the `QueryBuilder` had to
be extended to support retrieving log records for a given `Node`. The
following changes were applied:

 * Internal `log_model_class` has been removed and replaced by `orm.Log`
 * Added support to join node to log (`with_node`)
 * Added support to join log to node (`with_log`)

The `verdi export create` has a new flag to include or exclude the
export of logs for nodes that are to be exported, defaulting to include:

 `--include-logs/--exclude-logs`

Documentation has been updated to include new `QueryBuilder` join args
in table and the `metadata.json` example has been updated in documentation
to include correct Log info.
Copy link
Copy Markdown
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added a comment, that I think is important but if you agree please open an issue/another PR and it's ok to merge this in the meantime

except Exception:
# Bring back the DB to the correct state if this setup part fails
self._revert_database_schema()
raise
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this raise (the same is also done in SQLA)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add it in PR #2402

@sphuber sphuber merged commit 93bdb16 into aiidateam:provenance_redesign Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants