Add java migrations for backfilling runs with job uuids and parents#1980
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1980 +/- ##
============================================
+ Coverage 78.20% 78.62% +0.42%
- Complexity 955 1003 +48
============================================
Files 194 197 +3
Lines 5303 5459 +156
Branches 420 424 +4
============================================
+ Hits 4147 4292 +145
- Misses 713 723 +10
- Partials 443 444 +1
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
08dc5c3 to
d7082b0
Compare
wslulciuc
left a comment
There was a problem hiding this comment.
Minor comments, but great to learn about writing migrations as java classes for flyway! I think backfills in this format make a lot of sense 💯 💯 🥇
| private static final String FIND_AIRFLOW_PARENT_RUNS_SQL = | ||
| """ | ||
| SELECT DISTINCT(run_uuid) AS run_uuid, | ||
| e.parent_run_id, |
There was a problem hiding this comment.
Oh wait, this is the OpenLineage event!
There was a problem hiding this comment.
Oh wait, this is the OpenLineage event!
| public static final String INSERT_PARENT_RUN_QUERY = | ||
| """ | ||
| INSERT INTO runs (uuid, created_at, updated_at, current_run_state, external_id, namespace_name, job_name, job_uuid, location, transitioned_at, started_at, ended_at) | ||
| SELECT :parentRunUuid, created_at, updated_at, current_run_state, :externalRunid, :namespace, :jobName, :parentJobUuid, location, transitioned_at, started_at, ended_at |
There was a problem hiding this comment.
Typo: externalRunid -> externalRunId (though looks like it still works, so maybe more of a naming convention thing)
There was a problem hiding this comment.
The same typo is present here and in the binding on line 147. If they didn't match, it would throw an exception
| import org.jdbi.v3.core.result.ResultProducers; | ||
|
|
||
| @Slf4j | ||
| public class V44_2_BackfillJobsWithParents implements JavaMigration { |
There was a problem hiding this comment.
Minor: I'd make it explicit that it's for parent runs BackfillJobsWithParentRuns
| import org.jdbi.v3.core.result.ResultProducers; | ||
|
|
||
| @Slf4j | ||
| public class V44_2_BackfillJobsWithParents implements JavaMigration { |
There was a problem hiding this comment.
Minor: I'd make it explicit that it's for parent runs BackfillJobsWithParentRuns
| WHERE job_name=j.name AND namespace_name=j.namespace_name | ||
| ORDER BY transitioned_at DESC | ||
| LIMIT 1 | ||
| ) r ON true |
There was a problem hiding this comment.
Fancy. Not sure how ON true works though?
| WHERE job_name=j.name AND namespace_name=j.namespace_name | ||
| ORDER BY transitioned_at DESC | ||
| LIMIT 1 | ||
| ) r ON true |
There was a problem hiding this comment.
Fancy. Not sure how ON true works though?
There was a problem hiding this comment.
See https://medium.com/kkempin/postgresqls-lateral-join-bfd6bd0199df . Basically, the lateral join subquery is executed for each record returned from the jobs table - which is why we can reference j.name and j.namespace_name here on line 28. Since each returned run is computed from the jobs record it's supposed to be joined to, there's no need to do anything more to join, like compare job_uuid. So ON true
| assertThat(jobByName) | ||
| .isPresent() | ||
| .get() | ||
| .hasFieldOrPropertyWithValue("name", new JobName(parentName + "." + task1Name)); |
…Airflow runs Signed-off-by: Michael Collado <collado.mike@gmail.com> Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
a53fe74 to
55e0c6e
Compare
…#1980) * Add migrations to support job parent relationship storage Signed-off-by: Michael Collado <collado.mike@gmail.com> * Update all job and run queries to reference jobs_view and runs_view Signed-off-by: Michael Collado <collado.mike@gmail.com> * Remove references to simple_name as job redirects handle redirecting simple name to fqn added unit test to verify Signed-off-by: Michael Collado <collado.mike@gmail.com> * Fix runs migration script Signed-off-by: Michael Collado <collado.mike@gmail.com> * Add java migrations for backfilling runs with job uuids and backfill Airflow runs Signed-off-by: Michael Collado <collado.mike@gmail.com> Signed-off-by: Michael Collado <collado.mike@gmail.com>
Problem
Continuation of https://github.com/MarquezProject/marquez/pull/1935/files , this adds the migration scripts for populating job parents for Airflow runs and other runs that have explicit parent run ids. Special handling is given to the Airflow tasks to match the behavior in OpenLineage. Unit tests are provided to verify the behavior of the backfill scripts
Checklist
CHANGELOG.mdwith details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sqldatabase schema migration according to Flyway's naming convention (if relevant)